filoma dedup tutorial¶

This notebook demonstrates the deduplication utilities added to filoma:

  • standalone helpers in filoma.dedup
  • integration via FileProfiler and ImageProfiler
  • a convenience method on filoma.DataFrame to evaluate duplicates across a column of paths

Installing optional dependencies¶

If you want image hashing or MinHash acceleration, install the optional packages. Pillow is required for image hashing. datasketch is optional for MinHash-based workflows. Run the following in a notebook cell if needed:

!pip install --upgrade pillow datasketch

If you run in an environment that already has these packages (for example when using the repository's dev environment), you can skip installation.

In [1]:
# Imports used by the examples
import os
import tempfile

from filoma import dedup
from filoma.dataframe import DataFrame
from filoma.files.file_profiler import FileProfiler
from filoma.images.image_profiler import ImageProfiler

# Optional: check Pillow availability for image examples
try:
    from PIL import Image

    _HAS_PIL = True
except Exception:
    _HAS_PIL = False

print("dedup module:", dedup)
print("Pillow available:", _HAS_PIL)
dedup module: <module 'filoma.dedup' from '/home/kalfasy/repos/filoma/src/filoma/dedup.py'>
Pillow available: True

Standalone text dedup example¶

Create two small text files that are near-duplicates and run find_duplicates with a lower threshold for short texts.

In [2]:
with tempfile.TemporaryDirectory() as td:
    p1 = os.path.join(td, "a.txt")
    p2 = os.path.join(td, "b.txt")
    with open(p1, "w") as f:
        f.write("the quick brown fox jumps over the lazy dog")
    with open(p2, "w") as f:
        f.write("the quick brown fox jumped over the lazy dog")

    res = dedup.find_duplicates([p1, p2], text_k=3, text_threshold=0.4)
    print("Standalone text duplicate result:")
    print(res)
Standalone text duplicate result:
{'exact': [], 'text': [['/tmp/tmpcu_ayv9z/a.txt', '/tmp/tmpcu_ayv9z/b.txt']], 'image': []}

Standalone image dedup example¶

If Pillow is available this creates two identical images and demonstrates perceptual hashing and grouping.

In [3]:
if not _HAS_PIL:
    print("Pillow is not available; skipping image dedup example")
else:
    with tempfile.TemporaryDirectory() as td:
        p1 = os.path.join(td, "img1.png")
        p2 = os.path.join(td, "img2.png")
        img = Image.new("RGB", (64, 64), color=(123, 200, 100))
        img.save(p1)
        img.save(p2)

        # Compute hashes and run find_duplicates
        h1 = dedup.ahash_image(p1)
        h2 = dedup.ahash_image(p2)
        print("aHash 1:", h1)
        print("aHash 2:", h2)
        res = dedup.find_duplicates([p1, p2], image_max_distance=2)
        print("Standalone image duplicate result:")
        print(res)
aHash 1: ffffffffffffffff
aHash 2: ffffffffffffffff
Standalone image duplicate result:
{'exact': [['/tmp/tmpwat5qnud/img1.png', '/tmp/tmpwat5qnud/img2.png']], 'text': [], 'image': [['/tmp/tmpwat5qnud/img1.png', '/tmp/tmpwat5qnud/img2.png']]}

Using FileProfiler for dedup fingerprints¶

FileProfiler exposes fingerprint_for_dedup() which produces a compact dict with sha256 and optional text_shingles or image_hash. This is handy for pipeline-style scanning.

In [4]:
prof = FileProfiler()
with tempfile.TemporaryDirectory() as td:
    p = os.path.join(td, "doc.txt")
    with open(p, "w") as f:
        f.write("this is a sample document used for dedup testing")

    fp = prof.fingerprint_for_dedup(p, compute_text=True)
    print("Fingerprint for dedup:")
    print(fp)
Fingerprint for dedup:
{'path': '/tmp/tmpwcnpv8ji/doc.txt', 'size': 48, 'sha256': '064d7354bd3bf25c401f0899a9cde918cedf90f80392ff43080028a551e15782', 'text_shingles': {'a sample document', 'is a sample', 'thi is a', 'for dedup test', 'document used for', 'used for dedup', 'sample document used'}, 'image_hash': None}

Using ImageProfiler to compute perceptual hashes¶

The ImageProfiler exposes compute_ahash / compute_dhash which delegate to filoma.dedup for consistent hash computation.

In [5]:
if not _HAS_PIL:
    print("Pillow is not available; skip ImageProfiler example")
else:
    ip = ImageProfiler()
    with tempfile.TemporaryDirectory() as td:
        p = os.path.join(td, "img.png")
        Image.new("RGB", (32, 32), color=(10, 20, 30)).save(p)
        print("ahash:", ip.compute_ahash(p))
        print("dhash:", ip.compute_dhash(p))
ahash: ffffffffffffffff
dhash: 0000000000000000

DataFrame convenience: evaluate_duplicates()¶

DataFrame.evaluate_duplicates() scans the path column and prints a small Rich summary table. It returns the raw groups for programmatic use.

In [6]:
# Build a DataFrame from the two text files used earlier and run evaluation
with tempfile.TemporaryDirectory() as td:
    p1 = os.path.join(td, "a.txt")
    p2 = os.path.join(td, "b.txt")
    with open(p1, "w") as f:
        f.write("the quick brown fox jumps over the lazy dog")
    with open(p2, "w") as f:
        f.write("the quick brown fox jumped over the lazy dog")

    df = DataFrame([p1, p2])
    groups = df.evaluate_duplicates(text_threshold=0.4, show_table=True)
    print("Returned groups:")
    print(groups)
         Duplicate Summary          
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Type  ┃ Groups ┃ Files In Groups ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ exact │ 0      │ 0               │
│ text  │ 1      │ 2               │
│ image │ 0      │ 0               │
└───────┴────────┴─────────────────┘
2025-09-13 22:10:51.733 | INFO     | filoma.dataframe:evaluate_duplicates:915 - Duplicate summary: exact=0 groups (0 files), text=1 groups (2 files), image=0 groups (0 files)
Returned groups:
{'exact': [], 'text': [['/tmp/tmpkfv5tp9n/a.txt', '/tmp/tmpkfv5tp9n/b.txt']], 'image': []}

Closing notes¶

  • For large datasets consider using datasketch.MinHash + LSH to scale text similarity.
  • For image deduping at scale consider using perceptual hashes + a nearest-neighbor index.
  • DataFrame.evaluate_duplicates() is intended as a quick way to get actionable groups; you can export the groups and apply cleaning workflows (drop, label, or move duplicates).