Metadata-Version: 2.4
Name: duphunt
Version: 0.1.0
Summary: Find duplicate files by content (SHA-256) — zero install, cross-platform. pip install duphunt — no brew/apt. Zero dependencies.
Author: yyfjj
License: MIT
Project-URL: Homepage, https://github.com/jjdoor/duphunt-py
Project-URL: Repository, https://github.com/jjdoor/duphunt-py
Project-URL: Issues, https://github.com/jjdoor/duphunt-py/issues
Keywords: duplicate,duplicates,dedupe,finder,hash,sha256,cli,filesystem,cleanup,disk
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Utilities
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# duphunt

**Find duplicate files by content — anywhere, with nothing to install.** The
great duplicate finders (`fdupes`, `jdupes`, `rdfind`, `fclones`) are native
binaries you have to `brew`/`apt`/`cargo install` first — which you can't always
do on a locked-down box, a colleague's laptop, a CI runner, or a container.
`duphunt` runs the moment you have Python or Node: `pip install duphunt` or
`npx duphunt .`. **Zero dependencies, no network.**

```bash
pip install duphunt

$ duphunt ~/Downloads

2 duplicate group(s), 5 files, 8.1 MB reclaimable

  4.1 MB × 2   4.1 MB reclaimable
    /Users/me/Downloads/invoice.pdf
    /Users/me/Downloads/invoice (1).pdf

  2.0 MB × 3   4.0 MB reclaimable
    /Users/me/Downloads/clip.mp4
    /Users/me/Downloads/clip-copy.mp4
    /Users/me/Downloads/old/clip.mp4
```

Groups are sorted biggest-waste-first, so the files worth deleting are at the top.

> This is the Python build. A result-equivalent Node build is on npm:
> `npx duphunt` (<https://github.com/jjdoor/duphunt>).

## How it works

1. **Group by size.** Two files of different sizes can't be identical, so files
   with a unique size are never even read.
2. **Hash the collisions.** Within each size group, each file is SHA-256 hashed
   (streamed in 64 KB chunks, so multi-GB files don't blow up memory).
3. **Report identical content.** Files with the same hash are true byte-for-byte
   duplicates, grouped and ranked by reclaimable space.

It reports — it never deletes. You decide what to remove.

## Usage

```bash
duphunt                      # scan the current directory
duphunt ~/Downloads ~/Desktop   # scan several roots at once
duphunt a.jpg b.jpg c.jpg    # or just compare specific files
duphunt . --json             # machine-readable
duphunt . --min-size 1048576 # ignore files under 1 MB
duphunt . --exit-code        # exit 1 if any duplicates exist (CI gate)
```

## Options

| Flag | Effect |
|------|--------|
| `--json` | Emit `{ groups, summary }` as JSON (raw byte sizes, full paths) |
| `--quiet` | Print only the one-line summary |
| `--min-size <n>` | Ignore files smaller than `n` bytes (default `1` — skips empty files) |
| `--follow` | Follow symlinks (default: skip them, to avoid loops and double-counting) |
| `--exit-code` | Exit `1` when duplicates are found (for CI gates) |
| `-v`, `--version` | Print version |
| `-h`, `--help` | Show help |

## Notes

- **Empty files are skipped by default** (they all hash alike and are rarely what
  you mean); pass `--min-size 0` to include them.
- **Symlinks are skipped** unless `--follow`, so a symlinked tree won't be
  double-counted or loop forever.
- **Each physical file is counted once.** Repeated or overlapping roots and
  symlink aliases (even under `--follow`) are de-duplicated by real path, so they
  never inflate the results — while genuine hard links still surface.
- **Same tool, two builds.** The Python and Node builds hash with SHA-256 and
  produce identical results — use whichever your environment already has.

## `--json` shape

```json
{
  "groups": [
    { "hash": "9f86d0…", "size": 4300000, "count": 2, "wasted": 4300000,
      "paths": ["/a/invoice.pdf", "/b/invoice (1).pdf"] }
  ],
  "summary": { "groups": 1, "files": 2, "wasted": 4300000 }
}
```

## Exit codes

| Code | Meaning |
|------|---------|
| `0` | success (default — even when duplicates are found) |
| `1` | duplicates found **and** `--exit-code` was passed |
| `2` | error (bad option, missing path) |

By default `duphunt` is a viewer and exits `0`; add `--exit-code` to gate a
pipeline on it.

## License

MIT
