Metadata-Version: 2.4
Name: imgeda
Version: 0.0.5
Summary: High-performance image dataset exploratory data analysis CLI tool
Project-URL: Homepage, https://github.com/caylent/imgeda
Project-URL: Repository, https://github.com/caylent/imgeda
Project-URL: Bug Tracker, https://github.com/caylent/imgeda/issues
Author-email: "Randall Hunt (Caylent)" <randall.hunt@caylent.com>
License: MIT
License-File: LICENSE
Keywords: analysis,cli,dataset,duplicates,eda,image,quality
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.10
Requires-Dist: imagehash>=4.3
Requires-Dist: matplotlib>=3.10
Requires-Dist: numpy>=2.0
Requires-Dist: orjson>=3.10
Requires-Dist: pillow>=11.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: questionary>=2.1
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.15
Provides-Extra: dev
Requires-Dist: moto[s3]>=5.0; extra == 'dev'
Requires-Dist: mypy>=1.14; extra == 'dev'
Requires-Dist: pytest-cov>=6.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.3; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.9; extra == 'dev'
Provides-Extra: opencv
Requires-Dist: opencv-python-headless>=4.10; extra == 'opencv'
Provides-Extra: parquet
Requires-Dist: pyarrow>=15.0; extra == 'parquet'
Description-Content-Type: text/markdown

# imgeda

High-performance CLI tool for exploratory data analysis of image datasets.

Scan folders of images, generate JSONL manifests with metadata and pixel statistics, detect quality issues, find duplicates, and produce publication-ready visualizations — all from the command line.

[![PyPI](https://img.shields.io/pypi/v/imgeda)](https://pypi.org/project/imgeda/)
[![Python](https://img.shields.io/pypi/pyversions/imgeda)](https://pypi.org/project/imgeda/)
[![License](https://img.shields.io/pypi/l/imgeda)](https://github.com/caylent/imgeda/blob/main/LICENSE)

## Installation

```bash
pip install imgeda
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv tool install imgeda
```

## Quick Start

```bash
# Scan a directory of images
imgeda scan ./images -o manifest.jsonl

# View dataset summary
imgeda info -m manifest.jsonl

# Check for quality issues
imgeda check all -m manifest.jsonl

# Generate all plots
imgeda plot all -m manifest.jsonl

# Generate an HTML report
imgeda report -m manifest.jsonl

# Compare two manifests
imgeda diff --old v1.jsonl --new v2.jsonl

# Run quality gate (exit code 2 on failure — CI-friendly)
imgeda gate -m manifest.jsonl -p policy.yml

# Export to Parquet (requires: pip install imgeda[parquet])
imgeda export parquet -m manifest.jsonl -o manifest.parquet
```

Or just run `imgeda` with no arguments for an interactive wizard that walks you through everything:

```bash
# Interactive mode — auto-detects dataset format (YOLO, COCO, VOC, classification, flat)
imgeda
```

The wizard detects your dataset structure, shows a summary panel with image counts, splits, and class info, then lets you pick which splits and analyses to run.

## Features

- **Fast parallel scanning** with multi-core `ProcessPoolExecutor` and Rich progress bars
- **Resumable** — Ctrl+C anytime, progress is saved. Re-run and it picks up where it left off
- **JSONL manifest** — append-only, crash-tolerant, one record per image
- **Per-image analysis**: dimensions, file size, pixel statistics (mean/std per channel), brightness, perceptual hashing (phash + dhash), border artifact detection
- **Quality checks**: corrupt files, dark/overexposed images, border artifacts, exact and near-duplicate detection
- **7 plot types** with automatic large-dataset adaptations
- **Single-page HTML report** with embedded plots and summary tables
- **Dataset format detection** — auto-detects YOLO, COCO, Pascal VOC, classification, and flat image directories with split-aware scanning
- **Interactive configurator** with Rich panels, split selection, and smart defaults
- **Lambda-compatible core** — the analysis functions have zero CLI dependencies, ready for serverless deployment
- **Manifest diff** — compare two manifests to track dataset changes over time
- **Quality gate** — policy-as-code YAML rules with CI-friendly exit codes
- **Parquet export** — streaming JSONL-to-Parquet conversion with flattened nested fields
- **AWS serverless deployment** — CDK + Step Functions + Lambda for S3-scale analysis

## Example Output

All examples below were generated from the [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) (3,680 images).

### Dimensions

Width vs. height scatter plot with reference lines for 720p, 1080p, and 4K resolutions.

![Dimensions](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/dimensions.png)

### Brightness Distribution

Histogram of mean brightness per image, with shaded regions for dark (<40) and overexposed (>220) images.

![Brightness](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/brightness.png)

### File Size Distribution

Log-scale histogram with annotated median, P95, and P99 percentile lines.

![File Size](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/file_size.png)

### Aspect Ratio Distribution

Histogram with reference lines at common ratios (1:1, 4:3, 3:2, 16:9).

![Aspect Ratio](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/aspect_ratio.png)

### Channel Distributions

Violin plots of mean R/G/B channel values across the dataset.

![Channels](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/channels.png)

### Border Artifact Analysis

Corner-to-center brightness delta histogram with configurable threshold line.

![Artifacts](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/artifacts.png)

### Duplicate Analysis

Duplicate group sizes and unique vs. duplicate breakdown.

![Duplicates](https://raw.githubusercontent.com/caylent/imgeda/main/docs/examples/duplicates.png)

## CLI Reference

### `imgeda scan <DIR>`

Scan a directory of images and produce a JSONL manifest.

```
Options:
  -o, --output PATH           Output manifest path [default: imgeda_manifest.jsonl]
  --workers INTEGER           Parallel workers [default: CPU count]
  --checkpoint-every INTEGER  Flush interval [default: 500]
  --resume / --no-resume      Auto-resume from existing manifest [default: resume]
  --force                     Force full rescan (ignore existing manifest)
  --skip-pixel-stats          Metadata-only scan (faster)
  --no-hashes                 Skip perceptual hashing
  --extensions TEXT            Comma-separated extensions to include
  --dark-threshold FLOAT      Dark image threshold [default: 40.0]
  --overexposed-threshold FLOAT  Overexposed threshold [default: 220.0]
  --artifact-threshold FLOAT  Border artifact threshold [default: 50.0]
  --max-image-dim INTEGER     Downsample threshold for pixel stats [default: 2048]
```

### `imgeda info -m <MANIFEST>`

Print a Rich-formatted dataset summary.

### `imgeda check <SUBCOMMAND> -m <MANIFEST>`

Subcommands: `corrupt`, `exposure`, `artifacts`, `duplicates`, `all`

### `imgeda plot <SUBCOMMAND> -m <MANIFEST>`

Subcommands: `dimensions`, `file-size`, `aspect-ratio`, `brightness`, `channels`, `artifacts`, `duplicates`, `all`

```
Common options:
  -o, --output PATH    Output directory [default: ./plots]
  --format TEXT         Output format: png, pdf, svg [default: png]
  --dpi INTEGER         DPI for output [default: 150]
  --sample INTEGER      Sample N records for large datasets
```

### `imgeda report -m <MANIFEST>`

Generate a single-page HTML report with embedded plots and statistics.

### `imgeda diff --old <MANIFEST> --new <MANIFEST>`

Compare two manifests and show added, removed, and changed images with field-level diffs.

```
Options:
  -o, --out PATH    Output JSON path (optional)
```

### `imgeda gate -m <MANIFEST> -p <POLICY>`

Evaluate a manifest against a YAML quality policy. Exit code 0 = pass, 2 = fail.

```
Options:
  -o, --out PATH    Output JSON path (optional)
```

Example policy (`policy.yml`):
```yaml
max_corrupt_pct: 1.0
max_overexposed_pct: 5.0
max_underexposed_pct: 5.0
max_duplicate_pct: 10.0
min_images_total: 100
```

### `imgeda export parquet -m <MANIFEST> -o <OUTPUT>`

Export manifest to Parquet format with flattened nested fields. Requires `pip install imgeda[parquet]`.

## Architecture

See [docs/architecture.md](docs/architecture.md) for detailed system diagrams including the local CLI flow, AWS serverless flow, CI/CD quality gate flow, and full module dependency graph.

## Manifest Format

The manifest is a JSONL file (one JSON object per line):

- **Line 1**: Metadata header (input directory, scan settings, schema version)
- **Lines 2+**: One `ImageRecord` per image with all computed fields

```jsonl
{"__manifest_meta__": true, "input_dir": "./images", "created_at": "2026-02-17T12:00:00", ...}
{"path": "./images/cat.jpg", "width": 500, "height": 375, "format": "JPEG", "phash": "a1b2c3d4", ...}
```

The manifest is append-only and crash-tolerant. Resume is keyed on `(path, file_size, mtime)` — modified files are automatically re-analyzed.

## Performance

Tested on a 10-core Apple M1 Pro with SSD:

| Operation | 3,680 images |
|-----------|-------------|
| Full scan (metadata + pixels + hashes) | ~8s |
| Plot generation | ~3s |
| HTML report | ~4s |

The tool is designed to handle 100K+ image datasets with batched processing, memory-bounded futures, and automatic plot adaptations for large datasets.

## Development

```bash
git clone https://github.com/caylent/imgeda.git
cd imgeda
uv sync --all-extras
uv run pytest
uv run ruff check src/ tests/
```

## License

MIT
