Metadata-Version: 2.4
Name: tdupes
Version: 0.3.0
Summary: Find, review, and safely trash duplicate and near-duplicate files
License: MIT License
        
        Copyright (c) 2026 tdupes contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/samsy/tdupes
Keywords: duplicates,fdupes,deduplication,files,near-duplicates,locate,plocate,trash
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Environment :: Console
Classifier: Topic :: Utilities
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Dynamic: license-file

# tdupes

Smartly find, review, and safely trash exact and near-duplicate files on Linux.

`tdupes` detects **exact duplicates** (byte-identical, via `fdupes`) and
optionally **near-duplicates** (same basename, scored by content similarity, via
`plocate`/`locate`).  Results are written to a TSV that you review and edit with
your favourite spreadsheet tool before any files are touched; confirmed deletions
go to `gio trash` and remain recoverable until the bin is emptied.

Key features:

* Accepts any mix of individual files and directories as arguments
* Near-duplicate detection with `-L`: for files given as arguments it
  finds same-basename files across the filesystem, with a similarity
  score (text %, binary same/different size)
* Preferred-directory protection — files inside configured dirs are never
  proposed to be deleted by default
* Preferred dirs and exclusion patterns can be specified by config file
  or via `-p`/`-x` flags upon execution
* Prepares a smart action plan to a TSV table and allows its interactive
  editing with your favourite spreadsheet tool (TSV opened with `xdg-open`)
* Automated batch mode also available (the TSV serves then as a log)

## Install

```bash
pip install tdupes
```

System dependencies (Ubuntu/Debian):

```bash
sudo apt install fdupes plocate gvfs-bin xdg-utils
```

## Usage

```
tdupes [OPTIONS] PATH [PATH ...]

Positional arguments:
  PATH               Files or directories to scan for duplicates

Options:
  -l, --locate       Expand file arguments via locatedb (exact basename matches)
  -L, --locate-all   Like -l, but also tabulate near-duplicates (same basename,
                     not byte-identical) with real similarity codes
  -t FILE, --tsv FILE
                     Path for the output TSV (default: temp file)
  -p DIR, --prefer DIR
                     Mark DIR as preferred at runtime (files inside are never
                     proposed for deletion). Additive with config. Repeatable.
  -x PATTERN, --exclude PATTERN
                     Shell glob to exclude files by full path. Additive with
                     config. Repeatable: -x '*.tmp' -x '/mnt/*'
  -b, --batch        Batch mode: no prompts; execute DELETE actions immediately
  -v, --verbose      Increase output verbosity
  -q, --quiet        Reduce output verbosity
  -c, --config FILE  Config file path (default: $XDG_CONFIG_HOME/tdupes.yml)
  -V, --version      Show version and exit
  -h, --help         Show this help message and exit
```

### Examples

```bash
# Scan two directories interactively
tdupes ~/Pictures ~/Downloads

# Use locate to also find exact-duplicate copies of a specific file
tdupes --locate ~/Downloads/photo.jpg ~/Pictures

# Use locate and also include near-duplicates (same basename, different content)
tdupes -L ~/Downloads/photo.jpg ~/Pictures

# Batch mode (good for scripting / cron)
tdupes --batch ~/Documents

# Write the TSV to a specific path
tdupes -t /tmp/dupes.tsv ~/Music ~/Videos
```

## Config

On first run `tdupes` creates `$XDG_CONFIG_HOME/tdupes.yml` (defaults to
`~/.config/tdupes.yml`):

```yaml
preferred_directories: []   # files here are never proposed to be deleted
verbosity: 1                # 0=quiet, 1=normal, 2=verbose
tsv_output: null            # null = temp file each run
exclusion_patterns: []      # shell glob patterns to skip
batch_mode: false
```

**`preferred_directories`** — any file whose path begins with one of these
directories will be marked `keep` regardless of group ordering.

## TSV format

```
Action  Similarity  Size_KB  Modified              Path                              Comment
keep    100         2048.0   2024-11-01T14:22:10   /home/user/Pictures/photo.jpg     in preferred folder
DELETE  100         2048.0   2024-09-15T08:01:55   /home/user/Downloads/photo.jpg
```

| Column     | Values                                                                        |
|------------|-------------------------------------------------------------------------------|
| Action     | `keep` or `DELETE` — edit freely before confirming                            |
| Similarity | `100` exact · `XXX` binary same size · `NNN` text % match · `!!!` binary diff size |
| Size_KB    | File size in kilobytes                                                        |
| Modified   | Last-modified timestamp (ISO 8601)                                            |
| Path       | Absolute file path                                                            |
| Comment    | Reason for the proposed action (see below) — informational, ignored on re-read |

Groups are separated by blank lines. The first entry in each group is either
the file given as a CLI argument, or the newest copy.

Near-duplicate groups (found with `-L`) are written in a separate section after
the exact-duplicate groups, preceded by a `#` comment line.

### Default Action logic

**Exact-duplicate groups** (byte-identical per fdupes):

| Comment tag           | Rule                                                         |
|-----------------------|--------------------------------------------------------------|
| `in preferred folder` | File is inside a `preferred_directories` path → **keep**    |
| `last in group`       | Last file in the group (tiebreaker) → **keep**              |
| *(no tag)*            | All other copies → **DELETE**                               |

> **CLI argument files are listed first** in each group so they are *never* the
> last-in-group tiebreaker and therefore receive **DELETE** by default
> (unless they also fall under a preferred folder rule).

**Near-duplicate groups** (`-L`, same basename, not byte-identical):

| Comment tag                   | Rule                                                                  |
|-------------------------------|-----------------------------------------------------------------------|
| `in preferred folder`         | File is inside a `preferred_directories` path → **keep**             |
| `largest in basename group`   | Overall largest file in the group, *only if* no preferred file is larger → **keep** |
| `newest in basename group`    | Overall newest file in the group, *only if* no preferred file is newer → **keep** |
| *(no tag)*                    | Everything else → **DELETE**                                          |

> **CLI argument files are listed first** and may receive **DELETE** if they are
> neither the largest nor the newest (and not in a preferred folder).
>
> If a preferred-folder file is already the overall largest (or newest), no extra
> non-preferred copy is kept for that reason — the preferred file already covers it.

Multiple tags are comma-separated (e.g. `largest in basename group, newest in basename group`).
The Comment column is read-only — it is ignored when tdupes re-reads the TSV after you edit it.

## Interactive flow

1. `tdupes` scans paths and prints the duplicate table.
2. The TSV is opened with `xdg-open` for manual review.
3. You edit `Action` cells (change `DELETE` → `keep` or vice-versa), save, return.
4. `tdupes` re-reads the TSV and asks for confirmation.
5. On confirmation, all `DELETE` files are sent to the trash via `gio trash`.
6. A summary shows how many files were trashed and how much space was freed.

Files trashed with `gio trash` remain recoverable from the system trash until
the bin is emptied.

## License

MIT
