Metadata-Version: 2.4
Name: imgraft
Version: 0.1.0
Summary: Semantic image dataset curation using CLIP + HDBSCAN
License: MIT
Project-URL: Homepage, https://github.com/dasarinikhil/imgraft
Project-URL: Issues, https://github.com/dasarinikhil/imgraft/issues
Keywords: clip,hdbscan,dataset-curation,computer-vision,image-deduplication,semantic-clustering,umap
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: open-clip-torch>=2.20.0
Requires-Dist: hdbscan>=0.8.33
Requires-Dist: umap-learn>=0.5.3
Requires-Dist: numpy>=1.24.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: rich>=13.7.0
Requires-Dist: typer>=0.12.0
Requires-Dist: plotly>=5.18.0
Requires-Dist: scikit-learn>=1.3.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
Provides-Extra: vis
Requires-Dist: matplotlib>=3.8.0; extra == "vis"
Requires-Dist: kaleido>=0.2.1; extra == "vis"
Dynamic: license-file

# imgraft

**Semantic image dataset curation via CLIP + HDBSCAN.**

[![PyPI version](https://badge.fury.io/py/imgraft.svg)](https://pypi.org/project/imgraft/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/)

imgraft helps you clean, balance, and curate any messy image dataset — no labels, no manual annotation needed. Give it a folder of images and it returns a semantically balanced, deduplicated subset with full visualizations.

---

## How it works

![imgraft pipeline](docs/pipeline.png)

Five stages — all automatic:

1. **Embed** — every image gets a CLIP vector (ViT-B/32 or ViT-L/14)
2. **Cluster** — HDBSCAN finds semantic groups, UMAP pre-reduces for speed
3. **Curate** — pick a strategy: centroid, diversity, or text-query filter
4. **Inspect** — contact sheet PNGs per cluster for visual verification
5. **Export** — structured `core/` + `diverse/` + `noise/` folders, ready for training

---

## Why imgraft?

Raw image datasets are messy. Web scrapes contain near-duplicates. Crawled sets are class-imbalanced. Annotators mislabel. Training on noisy data silently destroys model performance.

imgraft solves this using **semantic similarity** — it embeds every image with CLIP, discovers visual groups with HDBSCAN, and selects the best representatives per cluster. No domain assumptions. Works on any image type.

> **Real-world impact**: On a production OCR project, an automated CLIP + HDBSCAN curation pipeline recovered model accuracy from 2% (after discovering 100K+ mislabeled images) to 90% on first retrain, reaching 95% through iteration.

---

## Install

```bash
pip install imgraft
```

For PNG visualization support:
```bash
pip install imgraft[vis]
```

---

## Quick start

### Python API

```python
from imgraft import Curator

curator = Curator(model="ViT-B/32")   # or "ViT-L/14" for higher accuracy

result = curator.run(
    image_dir="./raw_images/",
    keep_ratio=0.25,           # keep 25% of dataset
    strategy="diversity",      # centroid | diversity | text-query | drop-noise
    drop_noise=True,           # remove OOD/noise images
)

# ── verify clusters visually ───────────────────────────────────────────────────
result.inspect(
    output_dir="./cluster_grids/",   # one PNG contact sheet per cluster
    n_per_side=5,                    # 5 centroid + 5 random thumbnails side by side
)
# open cluster_000.png, cluster_001.png etc. — visually verify before committing

# ── export structured dataset ──────────────────────────────────────────────────
result.export_clusters(
    output_dir="./dataset/",
    n_core=50,      # N most representative images per cluster
    n_diverse=50,   # N most diverse images per cluster
)

# ── interactive UMAP explorer ──────────────────────────────────────────────────
result.plot("clusters.html")   # hover any point to see the image

print(result.stats())
# {'total': 8211, 'kept': 1642, 'clusters': 47, 'noise': 312, ...}
```

### CLI

```bash
# basic — keep 25% using diversity sampling
imgraft run ./images/ --keep 0.25 --out ./curated/

# with interactive visualization
imgraft run ./images/ --keep 0.25 --visualize --out ./curated/

# filter by text query (zero-shot, no labels needed)
imgraft run ./images/ \
  --strategy text-query \
  --query "a clear front-facing product photo on white background" \
  --out ./curated/

# drop noise/OOD images only, keep everything else
imgraft run ./images/ --strategy drop-noise --out ./cleaned/

# inspect dataset structure before curating
imgraft info ./images/
```

---

## Cluster inspection

Before exporting your dataset, visually verify what each cluster contains:

```python
result.inspect(output_dir="./cluster_grids/", n_per_side=5)
```

Each PNG contact sheet shows **two sides**:

| Left | Right |
|---|---|
| Core — closest to cluster centroid | Random sample from the cluster |

Lets you spot bad clusters at a glance — if `cluster_003` is all blurry images or mislabeled examples, you know to drop it before training.

---

## Structured export

```python
result.export_clusters(
    output_dir="./dataset/",
    n_core=50,
    n_diverse=50,
)
```

Output layout:

```
dataset/
  cluster_000/
    core/           ← centroid-closest (most representative)
    diverse/        ← furthest-point sampled (max variety)
  cluster_001/
    core/
    diverse/
  ...
  noise/            ← all HDBSCAN outliers, isolated
  export_summary.json
```

`core/` is best for training classifiers. `diverse/` maximises variety and removes near-duplicates. `noise/` gives a clean view of OOD images to review or discard.

---

## Curation strategies

| Strategy | Best for | How it works |
|---|---|---|
| `diversity` | Removing near-duplicates, max variety | Greedy furthest-point sampling per cluster |
| `centroid` | Balanced, representative subsets | Keeps images closest to each cluster center |
| `text-query` | Domain-specific filtering, no labels | CLIP zero-shot similarity to a text prompt |
| `drop-noise` | Quick clean without size reduction | Removes HDBSCAN outliers only |

---

## Works on any domain

- **Web-scraped datasets** — deduplicate crawled images, remove OOD noise
- **Medical imaging** — balance X-ray / pathology / dermoscopy class distributions
- **Satellite / aerial** — curate geospatial image sets by region and content type
- **E-commerce / products** — deduplicate product catalogs by visual similarity
- **Industrial / manufacturing** — balance defect vs. normal in inspection datasets
- **Document / form images** — group by layout type, sample representative subset
- **General ML training prep** — quality control before sending to annotation

---

## Backbone options

| Model | Embedding dim | Speed | Quality |
|---|---|---|---|
| `ViT-B/32` | 512 | ⚡ Fast | Good |
| `ViT-B/16` | 512 | Medium | Better |
| `ViT-L/14` | 768 | Slower | High |
| `ViT-H/14` | 1024 | Slow | Highest |

Switch with `--model ViT-L/14` or `Curator(model="ViT-L/14")`.

---

## Embedding cache

Re-embedding large folders is slow. Cache embeddings to disk so reruns skip it:

```bash
imgraft run ./images/ --cache ./.imgraft_cache/ --keep 0.25
```

---

## Visualizations

**Interactive HTML** (`result.plot("clusters.html")`):
- UMAP scatter colored by cluster
- Hover any point → see the image thumbnail
- Kept images at full opacity, dropped faded

**Cluster grids** (`result.inspect("./grids/")`):
- One PNG per cluster — centroid sample vs random sample side by side
- Visual verification before committing to training

---

## License

MIT
