Metadata-Version: 2.1
Name: imgduptective
Version: 0.2.0
Summary: Find near-duplicate and exact-duplicate images using perceptual hashing
Author-Email: sacha <sachahony@gmail.com>, Sacha Hony <zazahohonini@gmail.com>
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Project-URL: Homepage, https://github.com/zazaho/imgduptective
Requires-Python: >=3.10
Requires-Dist: Pillow
Description-Content-Type: text/markdown

# Image Duplicates Detective (imgduptective)

Find near-duplicate and exact-duplicate images in your photo collections using perceptual hashing.

## How it works

imgduptective uses a gradient-based horizontal difference hash (dhash) to create a perceptual fingerprint of each image. Images that look similar will have similar hashes, even if the files differ in format, resolution, or compression. A hamming distance threshold controls how similar two images must be to count as duplicates.

Results are cached in a local SQLite database (`~/.config/imgduptective/`) so subsequent runs are fast — only new or modified files are processed.

## Installation

```bash
pip install .
```

Or for development:

```bash
pip install -e .
```

Requires Python 3.10+ and Pillow.

## Usage

```bash
# Find near-duplicates with hamming distance threshold of 5
imgduptective 5

# Find exact duplicates only (identical file content)
imgduptective --exact

# Add files to the database without comparing
imgduptective --add

# Check what duplicates would be found if current directory were added
imgduptective --check 5

# Show per-directory statistics
imgduptective --stats 5

# Open the built-in viewer to inspect and delete duplicates
imgduptective --view 5
```

## Options

| Flag | Description |
|------|-------------|
| `threshold` | Maximum hamming distance to consider a match (0 = identical perceptual hash) |
| `--view` | Open the tkinter viewer to browse and manage duplicate groups |
| `--stats` | Show per-directory duplicate statistics |
| `--check` | Preview what duplicates would be found without modifying the database |
| `--add` | Scan and hash files into the database without comparing |
| `--photos` | Only process common photo formats (jpg, png, heic, webp, tiff, bmp, gif) |
| `--exact` | Find exact file matches (same content) instead of perceptually similar |
| `--no-scan` | Skip file scanning/hashing entirely, use the database cache only |
| `--full-hash` | Use full-file SHA-1 instead of the default fast 64KB partial hash |
| `--project NAME` | Use a named project database (e.g., work, personal, holidays) |
| `--list-projects` | List available project databases with file counts |

## Projects

Organize separate photo collections into named projects. Each project has its own database:

```bash
# Scan work photos
cd ~/Photos/Work
imgduptective --project work --add

# Scan holiday photos
cd ~/Photos/Holidays
imgduptective --project holidays --add

# Find duplicates within holidays
imgduptective --project holidays 5

# List all projects
imgduptective --list-projects
```

Without `--project`, the default database is used.

## Performance

The tool uses several strategies to minimize scan time:

- **Partial hashing (default):** Only the first 64KB of each file is hashed (plus file size) for change detection. This is sufficient to distinguish different images while being 10-100x faster than full-file hashing on large files.
- **Stat-based caching:** On repeat scans, files whose size and modification time haven't changed skip hashing entirely (a single `stat()` call per file).
- **`--no-scan`:** For re-running comparisons with different thresholds without any file I/O.
- **`--full-hash`:** Forces full SHA-1 of entire file contents when exact integrity verification is needed.
- **Multiprocessing:** File hashing, image hash computation, and pair comparison all run in parallel.

## Viewer

The built-in tkinter viewer (`--view`) displays duplicate groups side by side:

- **←/→** or **n/p/space**: Navigate between groups
- **Click**: Select/deselect images for deletion
- **d** or **Delete**: Compress selected files with gzip and remove originals
- **q** or **Escape**: Quit

## Database

Hashes are stored in `~/.config/imgduptective/`:
- `imgduptective.db` — default project
- `imgduptective-{name}.db` — named projects

The database has two tables:
- **HashValueTable**: Content-addressed cache mapping file hashes to image perceptual hashes
- **FileTable**: Maps file paths to their file hash, image hash, size, and modification time

Files that no longer exist are automatically pruned from the database on each scan.

## License

MIT
