Metadata-Version: 2.4
Name: smartdownsample
Version: 0.1.0
Summary: Smart image downsampling for image classification datasets
Home-page: https://github.com/PetervanLunteren/smartdownsample
Author: peteraddax
Author-email: your.email@example.com
License: MIT
Project-URL: Homepage, https://github.com/PetervanLunteren/smartdownsample
Project-URL: Repository, https://github.com/PetervanLunteren/smartdownsample
Project-URL: Issues, https://github.com/PetervanLunteren/smartdownsample/issues
Project-URL: Documentation, https://github.com/PetervanLunteren/smartdownsample#readme
Keywords: image,downsampling,camera-trap,machine-learning,computer-vision,diversity,deduplication
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: imagehash>=4.2.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: natsort>=8.0.0
Requires-Dist: matplotlib>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Dynamic: author-email
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# smartdownsample

**Efficient downsampling for image classification datasets**

SmartDownsample selects the most diverse images from large collections, ideal for reducing dataset size while preserving visual variability.

## Installation

```bash
pip install smartdownsample
```

## Usage

```python
from smartdownsample import select_distinct

# Example list of image paths
my_image_list = [
    "path/to/img1.jpg",
    "path/to/img2.jpg",
    "path/to/img3.jpg",
    "path/to/img4.jpg"
]

# Simple selection - get 100 most diverse images
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100
)

# With visual verification to see excluded images in context
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100,
    show_verification=True
)

print(f"Selected {len(selected)} images")

```

## Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `image_paths` | Required | List of image file paths (str or Path objects) |
| `target_count` | Required | Exact number of images to select |
| `window_size` | `100` | Rolling window size (larger = better quality, slower) |
| `random_seed` | `42` | Random seed for reproducible results |
| `show_progress` | `True` | Whether to display progress bars |
| `show_verification` | `False` | Show visual verification comparing excluded vs included images |

## Step by Step

1. **Sort paths** by directory. Within each folder, files are naturally ordered (e.g., `img1.jpg`, `img2.jpg`, `img10.jpg`) so related images remain grouped.  
2. **Compute perceptual hashes** for all valid image paths.  
3. **Apply rolling window selection** on the hash array to choose indices of the most diverse images. This runs in O(n) time, scales to large classes of 100k+ images, and compares each candidate only to a sliding window of recent selections.  
4. **Return results** as `[valid_paths[i] for i in selected_indices]`.  
5. **Optional verification plot**: If `show_verification=True`, the algorithm displays a visual check of 18 randomly selected excluded images and their included counterpart. The visualization opens automatically in your default image viewer without saving files to disk.

## License

MIT License – see LICENSE file.
