filoma dedup tutorial¶
This notebook demonstrates the deduplication utilities added to filoma:
- standalone helpers in
filoma.dedup - integration via
FileProfilerandImageProfiler - a convenience method on
filoma.DataFrameto evaluate duplicates across a column of paths
Installing optional dependencies¶
If you want image hashing or MinHash acceleration, install the optional packages. Pillow is required for image hashing. datasketch is optional for MinHash-based workflows. Run the following in a notebook cell if needed:
!pip install --upgrade pillow datasketch
If you run in an environment that already has these packages (for example when using the repository's dev environment), you can skip installation.
# Imports used by the examples
import os
import tempfile
from filoma import dedup
from filoma.dataframe import DataFrame
from filoma.files.file_profiler import FileProfiler
from filoma.images.image_profiler import ImageProfiler
# Optional: check Pillow availability for image examples
try:
from PIL import Image
_HAS_PIL = True
except Exception:
_HAS_PIL = False
print("dedup module:", dedup)
print("Pillow available:", _HAS_PIL)
dedup module: <module 'filoma.dedup' from '/home/kalfasy/repos/filoma/src/filoma/dedup.py'> Pillow available: True
Standalone text dedup example¶
Create two small text files that are near-duplicates and run find_duplicates with a lower threshold for short texts.
with tempfile.TemporaryDirectory() as td:
p1 = os.path.join(td, "a.txt")
p2 = os.path.join(td, "b.txt")
with open(p1, "w") as f:
f.write("the quick brown fox jumps over the lazy dog")
with open(p2, "w") as f:
f.write("the quick brown fox jumped over the lazy dog")
res = dedup.find_duplicates([p1, p2], text_k=3, text_threshold=0.4)
print("Standalone text duplicate result:")
print(res)
Standalone text duplicate result:
{'exact': [], 'text': [['/tmp/tmpcu_ayv9z/a.txt', '/tmp/tmpcu_ayv9z/b.txt']], 'image': []}
Standalone image dedup example¶
If Pillow is available this creates two identical images and demonstrates perceptual hashing and grouping.
if not _HAS_PIL:
print("Pillow is not available; skipping image dedup example")
else:
with tempfile.TemporaryDirectory() as td:
p1 = os.path.join(td, "img1.png")
p2 = os.path.join(td, "img2.png")
img = Image.new("RGB", (64, 64), color=(123, 200, 100))
img.save(p1)
img.save(p2)
# Compute hashes and run find_duplicates
h1 = dedup.ahash_image(p1)
h2 = dedup.ahash_image(p2)
print("aHash 1:", h1)
print("aHash 2:", h2)
res = dedup.find_duplicates([p1, p2], image_max_distance=2)
print("Standalone image duplicate result:")
print(res)
aHash 1: ffffffffffffffff
aHash 2: ffffffffffffffff
Standalone image duplicate result:
{'exact': [['/tmp/tmpwat5qnud/img1.png', '/tmp/tmpwat5qnud/img2.png']], 'text': [], 'image': [['/tmp/tmpwat5qnud/img1.png', '/tmp/tmpwat5qnud/img2.png']]}
Using FileProfiler for dedup fingerprints¶
FileProfiler exposes fingerprint_for_dedup() which produces a compact dict with sha256 and optional text_shingles or image_hash. This is handy for pipeline-style scanning.
prof = FileProfiler()
with tempfile.TemporaryDirectory() as td:
p = os.path.join(td, "doc.txt")
with open(p, "w") as f:
f.write("this is a sample document used for dedup testing")
fp = prof.fingerprint_for_dedup(p, compute_text=True)
print("Fingerprint for dedup:")
print(fp)
Fingerprint for dedup:
{'path': '/tmp/tmpwcnpv8ji/doc.txt', 'size': 48, 'sha256': '064d7354bd3bf25c401f0899a9cde918cedf90f80392ff43080028a551e15782', 'text_shingles': {'a sample document', 'is a sample', 'thi is a', 'for dedup test', 'document used for', 'used for dedup', 'sample document used'}, 'image_hash': None}
Using ImageProfiler to compute perceptual hashes¶
The ImageProfiler exposes compute_ahash / compute_dhash which delegate to filoma.dedup for consistent hash computation.
if not _HAS_PIL:
print("Pillow is not available; skip ImageProfiler example")
else:
ip = ImageProfiler()
with tempfile.TemporaryDirectory() as td:
p = os.path.join(td, "img.png")
Image.new("RGB", (32, 32), color=(10, 20, 30)).save(p)
print("ahash:", ip.compute_ahash(p))
print("dhash:", ip.compute_dhash(p))
ahash: ffffffffffffffff dhash: 0000000000000000
DataFrame convenience: evaluate_duplicates()¶
DataFrame.evaluate_duplicates() scans the path column and prints a small Rich summary table. It returns the raw groups for programmatic use.
# Build a DataFrame from the two text files used earlier and run evaluation
with tempfile.TemporaryDirectory() as td:
p1 = os.path.join(td, "a.txt")
p2 = os.path.join(td, "b.txt")
with open(p1, "w") as f:
f.write("the quick brown fox jumps over the lazy dog")
with open(p2, "w") as f:
f.write("the quick brown fox jumped over the lazy dog")
df = DataFrame([p1, p2])
groups = df.evaluate_duplicates(text_threshold=0.4, show_table=True)
print("Returned groups:")
print(groups)
Duplicate Summary ┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Type ┃ Groups ┃ Files In Groups ┃ ┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ exact │ 0 │ 0 │ │ text │ 1 │ 2 │ │ image │ 0 │ 0 │ └───────┴────────┴─────────────────┘
2025-09-13 22:10:51.733 | INFO | filoma.dataframe:evaluate_duplicates:915 - Duplicate summary: exact=0 groups (0 files), text=1 groups (2 files), image=0 groups (0 files)
Returned groups:
{'exact': [], 'text': [['/tmp/tmpkfv5tp9n/a.txt', '/tmp/tmpkfv5tp9n/b.txt']], 'image': []}
Closing notes¶
- For large datasets consider using
datasketch.MinHash+ LSH to scale text similarity. - For image deduping at scale consider using perceptual hashes + a nearest-neighbor index.
DataFrame.evaluate_duplicates()is intended as a quick way to get actionable groups; you can export the groups and apply cleaning workflows (drop, label, or move duplicates).