Metadata-Version: 2.3
Name: evaluators
Version: 2.1.4
Summary: Various scene understanding and perception evaluation metrics.
Author: Kurt Stolle
Author-email: Kurt Stolle <kurt@computer.org>
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Dist: torch>2.7
Requires-Dist: torchvision
Requires-Dist: torchmetrics
Requires-Dist: tensordict>=0.11.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: matplotlib>=3.9.0
Requires-Dist: rich>=13.9.4
Requires-Dist: pydantic
Requires-Dist: opencv-python>=4.13.0.92
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: hydra-zen>=0.16.0
Requires-Dist: blosc2>=4.1.2 ; extra == 'blosc2'
Requires-Dist: lightning>=2.6.1 ; extra == 'lightning'
Requires-Dist: lmdb>=2.2.0 ; extra == 'lmdb'
Requires-Dist: imagecodecs>=2026.3.6 ; extra == 'tiff'
Requires-Dist: tifffile>=2026.3.3 ; extra == 'tiff'
Requires-Python: >=3.13, <3.14
Provides-Extra: blosc2
Provides-Extra: lightning
Provides-Extra: lmdb
Provides-Extra: tiff
Description-Content-Type: text/markdown

# Scalable Distributed Evaluation for Computer Vision

[![PyPI](https://img.shields.io/pypi/v/evaluators)](https://pypi.org/project/evaluators/)
[![Python Version](https://img.shields.io/pypi/pyversions/evaluators)](https://pypi.org/project/evaluators/)
[![License](https://img.shields.io/pypi/l/evaluators)](https://github.com/khwstolle/evaluators/blob/main/LICENSE)
[![Build Status](https://img.shields.io/github/actions/workflow/status/khwstolle/evaluators/ci.yml)](https://github.com/khwstolle/evaluators/actions)

Evaluators is a high-throughput evaluation framework designed for large-scale computer vision research. It specializes in handling video tasks by decoupling inference I/O from metric computation.

This architecture enables offline evaluation workflows: models stream predictions to efficient storage backends (Memory Map or LMDB) during inference, and metrics are computed in a decoupled stage using distributed map-reduce logic. This approach prevents CPU-bound metric calculation from throttling GPU inference.

The key features are:

- **Zero-overhead inference.**  
  Writes predictions to disk using non-blocking I/O, allowing the training loop to run at full GPU utilization.

- **Distributed by design.**  
  Automatically handles synchronization across multiple nodes and GPUs using `torchmetrics` and custom scheduling logic.

- **Explicit Memory Schemas.**
  Uses Pydantic-based schemas to define data formats and encodings (PNG, TIFF, Raw) up front, ensuring type safety and storage efficiency.

- **Lazy loading.**  
  Supports referencing ground truth data from disk rather than duplicating it in memory caches, enabling evaluation of terabyte-scale datasets.

- **Multi-domain.**
  Includes verified implementations for:
  - Segmentation: Panoptic quality (PQ), semantic mIoU.
  - DVPS: Depth-aware video panoptic quality (DVPQ).
  - Depth: Eigen et al. metrics (AbsRel, RMSE).

- **CLI.**
  A command-line interface to inspect, index, and query saved inference results.

---

## Installation

```shell
pip install evaluators
```

## Quick start

### Python API

The core abstractions are `MetricStream` (for writing) and `run_offline_evaluation` (for computing).

#### Step 1: Inference (online)

```python
import torch
from evaluators import MetricStream, MemorySchema, TensorField, DynamicTemporalWriter
from evaluators.metrics.domain.segmentation import SemanticMetric
```

##### 1. Configure metrics and schema

```python
### Define the source for ground truth (lazy loading)

dataset = CityscapesDataset(...)
metric = SemanticMetric(
    num_classes=19,
    target_source=dataset, 
)

### Define the explicit memory schema

schema = MemorySchema(fields={
    "sem_seg": TensorField(dtype="int64", shape=(1024, 2048)),
    "sequence_id": TensorField(dtype="int64", shape=()),
    "frame_index": TensorField(dtype="int64", shape=()),
})
```

##### 2. Initialize stream and writer

```python
# Create a writer (backend)
writer = DynamicTemporalWriter(output_dir="./inference_cache/stream_1", schema=schema)

# Create a stream and bind the writer
stream = MetricStream(
    metrics=[metric],
    name="semantic",
    schema=schema
)
stream.bind(writer)
```

##### 3. Run inference loop

```python
for batch in dataloader: # Model forward pass
    preds = model(batch["image"])

    # Push to stream (non-blocking)
    stream.update(
        batch={
            "sem_seg": preds,
            "sequence_id": batch["sequence_id"],
            "frame_index": batch["frame_index"]
        }
    )
    
# Finalize
writer.close()
```

    > Note: `sequence_id` and `frame_index` must be `torch.int64`. 

##### 4. Finalize and compute

```python
from evaluators import run_offline_evaluation

# Syncs workers, builds catalog, and runs metrics
results = run_offline_evaluation(
    metrics=[metric],
    artifact_dir="./inference_cache/semantic"
)
print(results["SemanticMetric"]["mIoU"])
```

#### Step 2: Re-evaluation (offline)

Because predictions are persisted, metrics can be re-calculated or added without re-running the model.

```python
# Run evaluation on existing artifacts
results = run_offline_evaluation(
    metrics=[new_metric],
    artifact_dir="./inference_cache/semantic"
)
```

### CLI tools

The library includes a CLI for managing the inference cache.

_List stored sequences._

```shell
evaluators memory ls ./inference_cache
```

_Inspect specific tensor shapes._

```shell
evaluators memory inspect ./inference_cache --sequence_id frankfurt_000001
```

_Export to standard PyTorch file._

```shell
evaluators memory export ./inference_cache --sequence_id frankfurt_000001 --out my_video.pt
```

---

## Supported metrics

### Depth estimation

Implements standard error metrics (AbsRel, SqRel, RMSE, RMSElog) and threshold accuracies ($\delta < 1.25^n$).

### Segmentation

- _Semantic:_ Mean intersection over union (mIoU).
- _Panoptic:_ Panoptic quality (PQ), segmentation quality (SQ), recognition quality (RQ). Supports "Thing" and "Stuff" splits.

### Depth-aware video panoptic segmentation (DVPS)

Implements DVPQ (Depth-aware video panoptic quality). This metric evaluates spatio-temporal consistency using sliding window tubes, gated by pixel-wise depth accuracy.

---

## Architecture

The evaluation pipeline consists of three stages.

1. **Write.**  
   Each GPU writes predictions to locally sharded files (e.g. `.memmap` or `.lmdb`). No communication occurs.
2. **Schedule.**  
   A synchronization barrier is reached. The main process aggregates metadata manifests from all shards to build a Global Catalog. It partitions the workload (videos) among workers using a greedy strategy to balance duration.
3. **Compute.**
   Workers iterate through their assigned Virtual Sequences. Data is streamed from disk, processed by torchmetrics, and reduced globally.

See [OFFLINE_EVALUATION.md](OFFLINE_EVALUATION.md) for detailed usage and design principles.

---

## Performance

`evaluators` is built for high-throughput I/O. The following benchmarks were conducted using the **Comprehensive Memory Evaluation Suite (CMES)** on a mobile workstation (i7-12700H, NVMe SSD).

### Throughput (FPS)

| Backend | Codec | Resolution | Write FPS | Read FPS | Compression |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Memmap** | Raw | 512x1024 | 250.8 | 501.6 | 1.0x |
| **Memmap** | Blosc | 512x1024 | 59.7 | 119.3 | 0.85x |
| **LMDB** | Raw | 512x1024 | 13.3 | 26.7 | 1.0x |
| **LMDB** | PNG | 512x1024 | 27.2 | 13.6 | 0.62x |
| **Filesystem**| TIFF | 512x1024 | 124.9 | 249.7 | 0.89x |

### Insights

- **Memmap is king for raw throughput:** The `MemmapTemporalWriter` achieves >500 FPS for read operations on mid-resolution video frames, making it ideal for fast metric computation.
- **Blosc provides the best balance:** Using `BloscCodec` with `Memmap` offers significant storage savings with minimal CPU overhead compared to PNG/TIFF.
- **LMDB for stability:** While slower for sequential video access, LMDB provides robust ACID compliance and is preferred for random-access metadata or small feature vectors.

> Full benchmark reports, including memory usage (Peak RSS) and CPU overhead plots, are available in `docs/benchmarks/`.

---

## Development

This project uses modern Python tooling for dependency management and quality assurance.

### Setup

We use `uv` for fast dependency management.

```shell
# Install dependencies
uv sync --all-extras
```

### Testing

Tests are managed by `pytest`.

```shell
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=evaluators
```

### Linting & Formatting

We use `ruff` for all linting and formatting needs.

```shell
# Check code style
uv run ruff check .

# Format code
uv run ruff format .
```

---

## Contributing

Contributions are welcome. Please ensure that:

1. New features are covered by tests.
2. Code passes all static analysis checks (`ruff`).
3. Architecture changes are discussed in an issue first.

## Acknowledgements

This work was developed at the Mobile Perception Systems (MPS) lab at Eindhoven University of Technology.

## License

This project is licensed under the MIT License.
