Metadata-Version: 2.4
Name: duperemover
Version: 0.1.0.1
Summary: A Python utility for efficiently deduplicating files in directories
Project-URL: Homepage, https://github.com/daedalus/duperemover
Project-URL: Repository, https://github.com/daedalus/duperemover
Project-URL: Issues, https://github.com/daedalus/duperemover/issues
Author-email: Darío Clavijo <clavijodario@gmail.com>
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: blake3
Requires-Dist: mmappickle
Requires-Dist: pybloom-live
Requires-Dist: tqdm
Requires-Dist: xxhash
Provides-Extra: all
Requires-Dist: hatch; extra == 'all'
Requires-Dist: hypothesis; extra == 'all'
Requires-Dist: mypy; extra == 'all'
Requires-Dist: pytest; extra == 'all'
Requires-Dist: pytest-asyncio; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: pytest-mock; extra == 'all'
Requires-Dist: ruff; extra == 'all'
Provides-Extra: dev
Requires-Dist: hatch; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: lint
Requires-Dist: mypy; extra == 'lint'
Requires-Dist: ruff; extra == 'lint'
Provides-Extra: test
Requires-Dist: hypothesis; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-asyncio; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Requires-Dist: pytest-mock; extra == 'test'
Description-Content-Type: text/markdown

# Duperemover

> A Python utility for efficiently deduplicating files in directories.

[![PyPI](https://img.shields.io/pypi/v/duperemover.svg)](https://pypi.org/project/duperemover/)
[![Python](https://img.shields.io/pypi/pyversions/duperemover.svg)](https://pypi.org/project/duperemover/)
[![Coverage](https://codecov.io/gh/daedalus/duperemover/branch/main/graph/badge.svg)](https://codecov.io/gh/daedalus/duperemover)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

## Install

```bash
pip install duperemover
```

## Usage

```python
from duperemover import Deduplicator

dedup = Deduplicator(
    directory="/path/to/directory",
    hash_algorithm="xxhash",
    replace_strategy="hardlink",
    progress=True,
)
dedup.deduplicate()
dedup.print_stats()
```

## CLI

```bash
duperemover --help
```

### Command Syntax

```
duperemover <directory> [options]

Arguments:
  <directory>            Directory to scan for duplicates.
  --hash-file <file>     File to store hashes (default: .hashes.db).
  --buffer-size <size>   Buffer size for hashing (default: 65536, 64KB).
  --hash-algorithm <alg> Hashing algorithm (choices: "xxhash", "blake3", "sha256", default: "xxhash" if available).
  --replace-strategy <strategy> Strategy for handling duplicates (choices: "hardlink", "delete", "rename", default: "hardlink").
  --max-threads <num>    Number of threads to use for processing (default: 4).
  --sync-interval <num>  Sync interval for hashes to disk (default: 100).
  --progress             Show a progress bar while processing files.
  --dry-run              Simulate the deduplication process without making any changes.
  --use-bloom-filter     Use Bloom filter to speed up duplicate checking.
  --exclude PATTERNS     Exclude files matching these patterns.
```

### Examples

```bash
# Basic deduplication (using default hashing algorithm)
duperemover /path/to/directory

# Using SHA256 as the hashing algorithm
duperemover /path/to/directory --hash-algorithm sha256

# Simulate deduplication (dry run)
duperemover /path/to/directory --dry-run

# Create hard links for duplicates, use Bloom filter, and show progress
duperemover /path/to/directory --replace-strategy hardlink --use-bloom-filter --progress
```

## Features

- **Hash Algorithms**: Choose between `xxhash`, `blake3`, and `sha256` for calculating file hashes.
- **Duplicate Handling Strategies**: 
  - `hardlink`: Replace duplicates with hard links.
  - `delete`: Delete duplicate files.
  - `rename`: Rename duplicate files by appending `.duplicate` to their names.
- **Multi-threading**: Process files in parallel to speed up deduplication.
- **Bloom Filter**: Optionally, enable the Bloom filter to speed up duplicate checks by avoiding re-hashing files.
- **Exclusion Patterns**: Exclude files matching specific patterns from the deduplication process.
- **Progress Bar**: Optionally display a progress bar for better visibility during the deduplication process.
- **Dry Run**: Run the deduplication process without making any actual changes (useful for testing).

## API

### Deduplicator

```python
from duperemover import Deduplicator
```

#### Constructor

```python
Deduplicator(
    directory: str,
    hash_file: str = ".hashes.db",
    buffer_size: int = 65536,
    hash_algorithm: str = "xxhash",
    replace_strategy: str = "hardlink",
    max_threads: int = 4,
    sync_interval: int = 100,
    progress: bool = False,
    dry_run: bool = False,
    exclude_patterns: list[str] | None = None,
    use_bloom_filter: bool = False,
)
```

#### Methods

- `deduplicate()`: Scan the directory for duplicates and process each file.
- `print_stats()`: Print deduplication statistics.
- `count_files(directory)`: Count the number of files in a directory.
- `get_file_hash(file_path)`: Calculate and return the hash of a file.
- `are_same_file(file1, file2)`: Check if two files are the same based on their inodes.
- `create_hard_link(source, target)`: Create a hard link from the source file to the target file.
- `delete_duplicate(file_path)`: Delete a duplicate file.
- `rename_duplicate(file_path)`: Rename a duplicate file by appending `.duplicate`.
- `is_excluded(file_path)`: Check if a file matches any exclusion pattern.

## Development

```bash
git clone https://github.com/daedalus/duperemover.git
cd duperemover
pip install -e ".[test]"

# run tests
pytest

# format
ruff format src/ tests/

# lint
ruff check src/ tests/

# type check
mypy src/
```
