Metadata-Version: 2.4
Name: cs-hashindex
Version: 20250531
Summary: A command and utility functions for making listings of file content hashcodes and manipulating directory trees based on such a hash index.
Keywords: python3
Author-email: Cameron Simpson <cs@cskk.id.au>
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Requires-Dist: blake3
Requires-Dist: cs.cmdutils>=20240211
Requires-Dist: cs.context>=20250528
Requires-Dist: cs.deco>=20250531
Requires-Dist: cs.fs>=20250528
Requires-Dist: cs.fstags>=20250528
Requires-Dist: cs.hashutils>=20250414.1
Requires-Dist: cs.lex>=20250428
Requires-Dist: cs.logutils>=20250323
Requires-Dist: cs.pfx>=20250308
Requires-Dist: cs.psutils>=20250513
Requires-Dist: cs.resources>=20250325
Requires-Dist: cs.upd>=20250426
Requires-Dist: icontract
Requires-Dist: typeguard
Project-URL: MonoRepo Commits, https://bitbucket.org/cameron_simpson/css/commits/branch/main
Project-URL: Monorepo Git Mirror, https://github.com/cameron-simpson/css
Project-URL: Monorepo Hg/Mercurial Mirror, https://hg.sr.ht/~cameron-simpson/css
Project-URL: Source, https://github.com/cameron-simpson/css/blob/main/lib/python/cs/hashindex.py

A command and utility functions for making listings of file content hashcodes
and manipulating directory trees based on such a hash index.

*Latest release 20250531*:
hashindex rsync: bugfix remote rearrange, rearranges dstdir into dstdir using the source index.

This largely exists to solve my "what has changed remotely?" or
"what has been filed where?" problems by comparing file trees
using the files' content hashcodes.

This does require reading every file once to compute its hashcode,
but the hashcodes (and file sizes and mtimes when read) are
stored beside the file in `.fstags` files (see the `cs.fstags`
module), so that a file does not need to be reread on subsequent
comparisons.

`hashindex` knows how to invoke itself remotely using `ssh`
(this does require `hashindex` to be installed on the remote host)
and can thus be used to compare a local and remote tree, for example:

    hashindex comm -1 localtree remotehost:remotetree

When you point `hashindex` at a remote tree, it uses `ssh` to
run `hashindex` on the remote host, so all the content hashing
is done locally to the remote host instead of copying files
over the network.

You can also use it to rearrange a tree based on the locations
of corresponding files in another tree. Consider a media tree
replicated between 2 hosts. If the source tree gets rearranged,
the destination can be equivalently rearranged without copying
the files, for example:

    hashindex rearrange sourcehost:sourcetree localtree

If `fstags mv` was used to do the original rearrangement then
the hashcodes will be copied to the new locations, saving a
rescan of the source file. I keep a shell alias `mv="fstags mv"`
so this is routine for me.

A common "backup to remote" use case of mine is addressed by:

    hashindex rsync src dst

which rearranges `dst` based on `src`, then uses rsync(1) to update `dst`.

I have a backup script [`histbackup`](https://hg.sr.ht/~cameron-simpson/css/browse/bin/histbackup)
which works by making a hard link tree of the previous backup
and `rsync`ing into it.  It has long been subject to huge
transfers if the source tree gets rearranged. Now it has a
`--hashindex` option to get it to run a `hashindex rearrange`
between the hard linking to the new backup tree and the `rsync`
from the source to the new tree.

If network bandwith is limited or quotaed, you can use the
comparison function to prepare a list of files missing from the
remote location and copy them to a transfer drive for carrying
to the remote site when opportune. Example:

    hashindex comm -1 -o '{fspath}' src rhost:dst \
    | rsync -a --files-from=- src/ xferdir/

I've got a script [`pref-xfer`](https://hg.sr.ht/~cameron-simpson/css/browse/bin-cs/prep-xfer)
which does this with some conveniences and sanity checks.

Short summary:
* `dir_filepaths`: Generator yielding the filesystem paths of the files in `dirpath`.
* `dir_remap`: Generator yielding `(srcpath,[remapped_paths])` 2-tuples based on the hashcodes keying `fspaths_by_hashcode`.
* `file_checksum`: Return the hashcode for the contents of the file at `fspath`. Warn and return `None` on `OSError`.
* `hashindex`: Generator yielding `(hashcode,filepath)` 2-tuples for the files in `src`, which may be a file or a `RemotePath` or a `(host,fspath)` 2-tuple or a filesystem path. Note that this yields `(None,filepath)` for files which cannot be accessed.
* `hashindex_map`: Construct a mapping of hashcodes to filesystem paths by walking `dirpath`.
* `HashIndexCommand`: A tool to generate and use indices of file content hashcodes.
* `localpath`: Return a filesystem path modified so that it connot be misinterpreted as a remote path such as `user@host:path`.
* `main`: Commandline implementation.
* `merge`: Merge `srcpath` to `dstpath`, _preserving `dstpath` if present_. Return `True` if something was done, `False` if this was a no-op. Raise `FileExistsError` if `dstpath` exists with different content.
* `paths_remap`: Generator yielding `(srcpath,fspaths)` 2-tuples.
* `read_hashindex`: A generator which reads line from the file `f` and yields `(hashcode,fspath)` 2-tuples for each line. If there are parse errors the `hashcode` or `fspath` may be `None`.
* `read_remote_hashindex`: A generator which reads a hashindex of a remote directory, This runs: `hashindex ls -h hashname -r rdirpath` on the remote host. It yields `(hashcode,fspath)` 2-tuples.
* `rearrange`: Rearrange the files in `dirpath` according to the hashcode->[relpaths] `fspaths_by_hashcode`.
* `remote_rearrange`: Rearrange a remote directory `srcdir` on `rhost` into `dstdir` on `rhost` according to the hashcode mapping `fspaths_by_hashcode`.
* `run_remote_hashindex`: Run a remote `hashindex` command. Return the `CompletedProcess` result or `None` if `doit` is false. Note that as with `cs.psutils.run`, the arguments are resolved via `cs.psutils.prep_argv`.

Module contents:
- <a name="dir_filepaths"></a>`dir_filepaths(dirpath: str, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x102d040e0>)`: Generator yielding the filesystem paths of the files in `dirpath`.
- <a name="dir_remap"></a>`dir_remap(srcdirpath: str, fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str)`: Generator yielding `(srcpath,[remapped_paths])` 2-tuples
  based on the hashcodes keying `fspaths_by_hashcode`.
- <a name="file_checksum"></a>`file_checksum(fspath: str, hashname: str = 'blake3', *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x102d040e0>) -> Optional[cs.hashutils.BaseHashCode]`: Return the hashcode for the contents of the file at `fspath`.
  Warn and return `None` on `OSError`.
- <a name="hashindex"></a>`hashindex(src: Union[io.TextIOBase, cs.fs.RemotePath, str, Tuple[Optional[str], str]], *, hashname: str, relative: bool = False, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x102e782c0>, **kw) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]`: Generator yielding `(hashcode,filepath)` 2-tuples
  for the files in `src`, which may be a file or a `RemotePath`
  or a `(host,fspath)` 2-tuple or a filesystem path.
  Note that this yields `(None,filepath)` for files which cannot be accessed.
- <a name="hashindex_map"></a>`hashindex_map(dirpath: str, *, hashname: str, relative=False) -> dict[cs.hashutils.BaseHashCode, list[str]]`: Construct a mapping of hashcodes to filesystem paths
  by walking `dirpath`.
- <a name="HashIndexCommand"></a>`class HashIndexCommand(cs.cmdutils.BaseCommand)`: A tool to generate and use indices of file content hashcodes.

  Usage summary:

      Usage: hashindex [common-options...] subcommand [options...]
        A tool to generate and use indices of file content hashcodes.
        Subcommands:
          comm [common-options...] {-1|-2|-3|-r} {path1|-} {path2|-}
            Compare the filepaths in path1 and path2 by content.
            Options:
              -1  List hashes and paths only present in path1.
              -2  List hashes and paths only present in path2.
              -3  List hashes and paths present in path1 and path2.
              -r  Emit relative paths in the listing.
          help [common-options...] [-l] [-s] [subcommand-names...]
            Print help for subcommands.
            This outputs the full help for the named subcommands,
            or the short help for all subcommands if no names are specified.
            Options:
              -l  Long listing.
              -r  Recurse into subcommands.
              -s  Short listing.
          info [common-options...] [field-names...]
            Recite general information.
            Explicit field names may be provided to override the default listing.
          ls [common-options...] [options...] [[host:]path...]
            Walk filesystem paths and emit a listing.
            The default path is the current directory.
            In quiet mode (-q) the hash indicies are just updated
            and nothing is printed.
            Options:
              -r  Emit relative paths in the listing.
                  This requires each command line path to be a directory.
          rearrange [common-options...] {[[user@]host:]refdir|-} [[user@]rhost:]srcdir [dstdir]
            Rearrange files from srcdir into dstdir based on their positions in refdir.
            Arguments:
              refdir    The reference directory, which may be local or remote
                        or "-" indicating that a hash index will be read from
                        standard input.
              srcdir    The directory containing the files to be rearranged,
                        which may be local or remote.
              dstdir    Optional destination directory for the rearranged files.
                        Default is the srcdir.
            Options:
              -1    Rearrange only one file.
              --ln  Hard link files instead of moving them.
              -s    Symlink mode.
          repl [common-options...]
            Run a REPL (Read Evaluate Print Loop), an interactive Python prompt.
            Options:
              --banner banner  Banner.
          rsync [common-options...] [options] srcdir dstdir
            Rearrange dstdir according to srcdir then rsync srcdir into dstdir.
            Options:
              --bwlimit bwlimit  Rsync bandwidth limit, passed to rsync.
              --delete           Delete from dstdir, passed to rsync.
              --partial          Keep partially transferred files, passed to rsync.
          shell [common-options...]
            Run a command prompt via cmd.Cmd using this command's subcommands.

*`HashIndexCommand.Options`*

*`HashIndexCommand.cmd_comm(self, argv, *, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x102eb3560>)`*:
Usage: {cmd} {{-1|-2|-3|-r}} {{path1|-}} {{path2|-}}
Compare the filepaths in path1 and path2 by content.
Options:
  -1  List hashes and paths only present in path1.
  -2  List hashes and paths only present in path2.
  -3  List hashes and paths present in path1 and path2.
  -r  Emit relative paths in the listing.

*`HashIndexCommand.cmd_ls(self, argv, *, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x102eb39c0>)`*:
Usage: {cmd} [options...] [[host:]path...]
Walk filesystem paths and emit a listing.
The default path is the current directory.
In quiet mode (-q) the hash indicies are just updated
and nothing is printed.
Options:
  -r  Emit relative paths in the listing.
      This requires each command line path to be a directory.

*`HashIndexCommand.cmd_rearrange(self, argv)`*:
Usage: {cmd} {{[[user@]host:]refdir|-}} [[user@]rhost:]srcdir [dstdir]
Rearrange files from srcdir into dstdir based on their positions in refdir.
Arguments:
  refdir    The reference directory, which may be local or remote
            or "-" indicating that a hash index will be read from
            standard input.
  srcdir    The directory containing the files to be rearranged,
            which may be local or remote.
  dstdir    Optional destination directory for the rearranged files.
            Default is the srcdir.
Options:
  -1    Rearrange only one file.
  --ln  Hard link files instead of moving them.
  -s    Symlink mode.

*`HashIndexCommand.cmd_rsync(self, argv, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x102d040e0>)`*:
Usage: {cmd} [options] srcdir dstdir
Rearrange dstdir according to srcdir then rsync srcdir into dstdir.
Options:
  --bwlimit bwlimit  Rsync bandwidth limit, passed to rsync.
  --delete           Delete from dstdir, passed to rsync.
  --partial          Keep partially transferred files, passed to rsync.

*`HashIndexCommand.poppathspec(argv: List[str], name: str = 'dirspec', check_isdir=False) -> cs.fs.RemotePath`*:
Pop a leading dirspec from `argv`, a filesystem path with
an optional leading `[user@]rhost:` prefix.
Return a `(host,fspath)` 2-tuple being the remote host (`None` if omitted)
and the filesystem path.
Raises `GetoptError` on a missing or invalid argument.

*`HashIndexCommand.run_context(self, *, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x102d040e0>, **kw)`*:
Sanity check the hashname, open the fstags.
- <a name="localpath"></a>`localpath(fspath: str) -> str`: Return a filesystem path modified so that it connot be
  misinterpreted as a remote path such as `user@host:path`.

  If `fspath` contains no colon (`:`) or is an absolute path
  or starts with `./` then it is returned unchanged.
  Otherwise a leading `./` is prepended.
- <a name="main"></a>`main(argv=None)`: Commandline implementation.
- <a name="merge"></a>`merge(srcpath: str, dstpath: str, *, opname=None, hashname: str, move_mode: bool = False, symlink_mode=False, doit=False, fstags: Optional[cs.fstags.FSTags] = <function <lambda> at 0x102d040e0>, quiet: bool) -> bool`: Merge `srcpath` to `dstpath`, _preserving `dstpath` if present_.
  Return `True` if something was done, `False` if this was a no-op.
  Raise `FileExistsError` if `dstpath` exists with different content.

  This is aimed at situations such as merging downloads with
  an existing corpus, which might have hard links etc, so
  `dstpath` is the important half of the pair.

  NB: `symlink_mode` is currently disabled.

  If 'dstpath' exists, checksum their contents and raise
  `FileExistsError` if they differ.
  If `dstpath` does not exist, move/link/symlink `srcpath` to `dstpath`.
  This also merges the fstags from `srcpath` to `dstpath`.

  Otherwise the files have the same content, merge while preserving `dstpath`.
- <a name="paths_remap"></a>`paths_remap(srcpaths: Iterable[str], fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, hashname: str)`: Generator yielding `(srcpath,fspaths)` 2-tuples.
- <a name="read_hashindex"></a>`read_hashindex(f, start=1, *, hashname: str) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]`: A generator which reads line from the file `f`
  and yields `(hashcode,fspath)` 2-tuples for each line.
  If there are parse errors the `hashcode` or `fspath` may be `None`.
- <a name="read_remote_hashindex"></a>`read_remote_hashindex(rhost: str, rdirpath: str, *, hashname: str, quiet=True, ssh_exe: str, hashindex_exe: str, relative: bool = False) -> Iterable[Tuple[Optional[cs.hashutils.BaseHashCode], Optional[str]]]`: A generator which reads a hashindex of a remote directory,
  This runs: `hashindex ls -h hashname -r rdirpath` on the remote host.
  It yields `(hashcode,fspath)` 2-tuples.

  Parameters:
  * `rhost`: the remote host, or `user@host`
  * `rdirpath`: the remote directory path
  * `hashname`: the file content hash algorithm name
  * `ssh_exe`: optional `ssh` command
  * `hashindex_exe`: the remote `hashindex` executable
  * `relative`: optional flag, default `False`;
    if true pass `'-r'` to the remote `hashindex ls` command
  * `check`: whether to check that the remote command has a `0` return code,
    default `True`
- <a name="rearrange"></a>`rearrange(srcdirpath: str, rfspaths_by_hashcode, dstdirpath: str | None = None, *, hashname: str, move_mode: bool = False, once: bool = False, symlink_mode=False, doit: bool, fstags: cs.fstags.FSTags, runstate: Optional[cs.resources.RunState] = <function uses_runstate.<locals>.<lambda> at 0x102e78f40>, quiet: bool)`: Rearrange the files in `dirpath` according to the
  hashcode->[relpaths] `fspaths_by_hashcode`.

  Parameters:
  * `srcdirpath`: the directory whose files are to be rearranged
  * `rfspaths_by_hashcode`: a mapping of hashcode to relative
    pathname to which the original file is to be moved
  * `dstdirpath`: optional target directory for the rearranged files;
    defaults to `srcdirpath`, rearranging the files in place
  * `hashname`: the file content hash algorithm name
  * `move_mode`: move files instead of linking them
  * `symlink_mode`: symlink files instead of linking them
  * `doit`: if true do the link/move/symlink, otherwise just print
- <a name="remote_rearrange"></a>`remote_rearrange(rhost: str, srcdir: str, dstdir: str, fspaths_by_hashcode: Mapping[cs.hashutils.BaseHashCode, List[str]], *, doit: bool, hashindex_exe: str, hashname: str, move_mode: bool, once: bool, quiet: bool, symlink_mode: bool)`: Rearrange a remote directory `srcdir` on `rhost` into `dstdir`
  on `rhost` according to the hashcode mapping `fspaths_by_hashcode`.
- <a name="run_remote_hashindex"></a>`run_remote_hashindex(rhost: str, argv, *, hashindex_exe: str, **subp_options)`: Run a remote `hashindex` command.
  Return the `CompletedProcess` result or `None` if `doit` is false.
  Note that as with `cs.psutils.run`, the arguments are resolved
  via `cs.psutils.prep_argv`.

  Parameters:
  * `rhost`: the remote host, or `user@host`
  * `argv`: the command line arguments to be passed to the
    remote `hashindex` command
  * `check`: whether to check that the remote command has a `0` return code,
    default `True`
  Other keyword parameters are passed therough to `cs.psutils.run`.

# Release Log



*Release 20250531*:
hashindex rsync: bugfix remote rearrange, rearranges dstdir into dstdir using the source index.

*Release 20250528*:
* New hashindex_map(dirpath) function exposing the code to make a hashcode->[fspath,...] mapping.
* New remote_rearrange(rhost,dstdir,fspaths_by_hashcode) function to rearrange a remote directory.
* HashIndexCommand: new cmd_rsync() to rearrange a target then rsync to it.
* HashIndexCommand.cmd_rearrange: honour "-" as the refdir to read the hash index from standard input.
* HashIndexCommand.cmd_rearrange: default to move mode, change the CLI options to have --ln instead of --mv.
* HashIndexCommand.cmd_rearrange: new -1 (once) option to only do a single file rename, handy for testing.
* Redo almost the entire merge() function for clearer logic.

*Release 20241207*:
Mostly CLI usage improvements.

*Release 20241007*:
Small internal changes.

*Release 20240709*:
* Require `blake3` and use it as the default hash algorithm.
* Some internal improvements.

*Release 20240623*:
hashindex: plumb hashname to file_checksum.

*Release 20240412*:
* file_checksum: skip any nonregular file, only use run_task when checksumming more than 1MiB.
* HashIndexCommand.cmd_rearrange: run the refdir index in relative mode.
* Small fixes.

*Release 20240317*:
* HashIndexCommand.cmd_ls: default to listing the current directory.
* HashIndexCommand: new -o output_format to allow outputting only hashcodes or fspaths.
* HashIndexCommand.cmd_comm: new -r (relative) option.

*Release 20240316*:
Fixed release upload artifacts.

*Release 20240305*:
* HashIndexCommand.cmd_ls: support rhost:rpath paths, honour intterupts in the remote mode.
* HashIndexCommand.cmd_rearrange: new optional dstdir command line argument, passed to rearrange.
* merge: symlink_mode: leave identical symlinks alone, just merge tags.
* rearrange: new optional dstdirpath parameter, default srcdirpath.

*Release 20240216*:
* HashIndexCommand.cmdlinkto,cmd_rearrange: run the link/mv stuff with sys.stdout in line buffered mode.
* DO not get hashcodes from symlinks.
* HashIndexCommand.cmd_ls: ignore None hashcodes, do not set xit=1.
* New run_remote_hashindex() and read_remote_hashindex() functions.
* dir_filepaths: skip dot files, the fstags .fstags file and nonregular files.

*Release 20240211.1*:
Better module docstring.

*Release 20240211*:
Initial PyPI release: "hashindex" command and utility functions for listing file hashcodes and rearranging trees based on a hash index.
