Metadata-Version: 2.4
Name: cleanframes
Version: 0.2.6
Summary: A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.
Home-page: https://github.com/abdullahalmutairi/cleanframes
Author: Abdullah Almutairi
Author-email: abdullah@example.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: torchvision
Requires-Dist: transformers
Requires-Dist: timm
Requires-Dist: open_clip_torch
Requires-Dist: scikit-learn
Requires-Dist: tqdm
Requires-Dist: pillow
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: imagehash
Requires-Dist: matplotlib
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CleanFrames

CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:

- **MD5 hashing** for exact byte-level duplicates.
- **Perceptual hashing** for visually similar images.
- **Deep embeddings** for semantic redundancy detection.

This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.

## Features

- Supports multiple embedding models: **Swin**, **CLIP**, **DINO**, and **ResNet**.
- Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
- Device support for CPU, GPU, and Apple MPS for accelerated processing.
- Outputs cleaned images into organized folders for easy inspection.
- Provides detailed results including removed duplicates and retained images.

## Installation

Install CleanFrames easily via pip:

```bash
pip install cleanframes
```

## Usage

### 1. Basic Usage: Clean by Path Only

CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.

```python
from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"

# Clean images by path only
cleaner.cleanframes(input_folder)
```

This will create output folders inside the input folder:
- `cleaned` - contains unique images after cleaning.
- `duplicates` - contains removed duplicate images.
- `results.json` - detailed report of the cleaning process.

### 2. Generate Embeddings and Clean

You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.

```python
from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)

# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)
```

### 3. Clean Using Custom Embeddings

If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.

```python
from clean_frames import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"

# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...]  # list of image file paths corresponding to embeddings

# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)
```

## Supported Embedding Models

- **Swin**: Hierarchical Vision Transformer for image representation.
- **CLIP**: Contrastive Language-Image Pretraining embeddings.
- **DINO**: Self-distillation with no labels for visual features.
- **ResNet**: Classic convolutional neural network embeddings.

You can generate embeddings with any of these models using corresponding methods provided by `CleanFrame` (e.g., `cleaner.CLIPEmbedding()`, `cleaner.DINOEmbedding()`, etc.).

## Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

- **CPU**: Default fallback.
- **CUDA GPU**: For NVIDIA GPUs.
- **MPS**: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify your device when initializing `CleanFrame`:

```python
cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'
```

## Output Structure

After cleaning, the tool creates the following inside the input folder or specified path:

- `cleaned/`: Contains the filtered set of unique images.
- `duplicates/`: Contains images identified as duplicates or near-duplicates.
- `results.json`: JSON file summarizing duplicates removed, thresholds used, and other metadata.

## Notes

- The `threshold` parameter controls sensitivity for near-duplicate detection; lower values remove more images.
- Combining multiple embedding models can improve detection accuracy.
- CleanFrames is designed to be scalable and efficient for large image datasets.

---

For more information and advanced options, please refer to the official documentation or GitHub repository.
