Metadata-Version: 2.4
Name: mosaic-temporal-gpu
Version: 0.1.0
Summary: High-speed video mosaic on CUDA: NVDEC/NVENC + torch-lap Hungarian (sibling of mosaic-temporal)
Project-URL: Homepage, https://github.com/hinanohart/mosaic-temporal-gpu
Project-URL: Documentation, https://github.com/hinanohart/mosaic-temporal-gpu/blob/main/DESIGN.md
Project-URL: Repository, https://github.com/hinanohart/mosaic-temporal-gpu
Project-URL: Issues, https://github.com/hinanohart/mosaic-temporal-gpu/issues
Project-URL: Changelog, https://github.com/hinanohart/mosaic-temporal-gpu/blob/main/CHANGELOG.md
Project-URL: Sibling, https://github.com/hinanohart/mosaic-temporal
Author: hinanohart
License-Expression: MIT
License-File: LICENSE
License-File: NOTICE
Keywords: cuda,gpu,hungarian,mosaic,nvdec,nvenc,torch,video
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: GPU :: NVIDIA CUDA :: 12
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Video
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: mosaicraft<1.0,>=0.3.1
Requires-Dist: numpy>=1.24
Requires-Dist: torch-linear-assignment>=0.0.6
Requires-Dist: torch>=2.2
Requires-Dist: torchvision>=0.17
Provides-Extra: dev
Requires-Dist: av>=12; extra == 'dev'
Requires-Dist: bandit>=1.7; extra == 'dev'
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pytest-cov>=4; extra == 'dev'
Requires-Dist: pytest-timeout>=2.2; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: scikit-image>=0.22; extra == 'dev'
Provides-Extra: fast
Requires-Dist: av>=12; extra == 'fast'
Provides-Extra: nvdec
Requires-Dist: av>=12; extra == 'nvdec'
Description-Content-Type: text/markdown

# mosaic-temporal-gpu

> **The high-speed sibling of [mosaic-temporal](https://github.com/hinanohart/mosaic-temporal).**
> NVDEC/NVENC + torch-lap Hungarian + on-GPU torch kernels (Triton port queued for v0.2).

> ⚠️ **Status: 0.1.0 release candidate.** Public API (`run_pipeline`),
> kernels, solver, NVDEC/NVENC bridge, config schema, and CPU-host tests
> are in place. The remaining work toward 0.1.0 final is the parity-gate
> CI on a CUDA runner and the bench-spike sign-off on Kaggle T4 — see
> [Roadmap](#roadmap). The Quickstart below is the supported API; the
> 3-stream CUDA-overlap optimization that motivated this repo lands in
> 0.2 without changing the signature.

[![CI](https://github.com/hinanohart/mosaic-temporal-gpu/actions/workflows/ci.yml/badge.svg)](https://github.com/hinanohart/mosaic-temporal-gpu/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org)

## Positioning

This is the **high-speed build** of the video mosaic pipeline. The portable
sibling `mosaic-temporal` keeps a CPU fallback at every step for users without
a GPU; **this repo drops every fallback** so the hot path can be NVDEC →
Triton → torch-lap → NVENC end-to-end. The cost is hard: **NVIDIA GPU with
CUDA ≥ 12.0 is required**. The benefit is real throughput on long clips.

| Feature                | mosaic-temporal           | mosaic-temporal-gpu (high-speed)       |
| ---------------------- | ------------------------- | -------------------------------------- |
| Hungarian assignment   | scipy CPU (default)       | torch-linear-assignment (only)         |
| Cost matrix            | numpy CPU loop            | torch.cdist on CUDA (Triton in v0.2)   |
| Oklab grid mean        | numpy                     | torch view+reduce on CUDA (Triton v0.2)|
| Video I/O              | cv2 PNG round-trip        | PyAV NVDEC → ndarray → NVENC           |
| RAFT optical flow      | CPU torch (slow)          | not in v0.1.0 — queued for v0.3        |
| Bit-exact CPU output   | yes (`bit-exact-cpu`)     | no — parity gated at SSIM ≥ 0.98       |
| Runtime requirement    | none                      | NVIDIA GPU with CUDA ≥ 12.0            |

If you need the CPU fallback, the bit-exact reference, or Windows/macOS support,
use [mosaic-temporal](https://github.com/hinanohart/mosaic-temporal). If you
have a CUDA GPU and want speed, you're in the right place.

## Install (once 0.1.0 ships to PyPI)

`mosaic-temporal-gpu` requires a CUDA build of PyTorch. Install torch first
from the official CUDA wheel index, then install this package:

```bash
# 1. CUDA 12.1 wheels (adjust cu121 to your CUDA version)
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision

# 2. Pure compute kernels only (no video I/O — no PyAV)
pip install mosaic-temporal-gpu

# 2'. With NVDEC/NVENC video I/O (needs a cuvid-enabled FFmpeg + PyAV).
#     The PyPI `av` wheel is software-only — see benchmarks/README.md for
#     the FFmpeg+PyAV self-build recipe. The `[nvdec]` extra declares the
#     `av>=12` dependency; it does NOT build FFmpeg for you.
pip install "mosaic-temporal-gpu[nvdec]"
```

If you skip step 1, pip will resolve `torch` to the **CPU build** from PyPI
and every CUDA-only call will fail at runtime — there is no CPU fallback on
purpose. NVIDIA driver ≥ R535 and CUDA ≥ 12.0 are prerequisites. Until 0.1.0
ships to PyPI, install from source:

```bash
git clone https://github.com/hinanohart/mosaic-temporal-gpu
cd mosaic-temporal-gpu
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
pip install -e ".[dev]"
```

## Quickstart

```python
from pathlib import Path
from mosaic_temporal_gpu import run_pipeline

stats = run_pipeline(
    input_video=Path("input.mp4"),
    output_video=Path("output.mp4"),
    tile_dir=Path("tiles/"),       # keyword-only
    fps=30,                        # NVENC output frame rate (input fps
                                   # auto-detection lands in 0.2)
    cq=19,                         # h264_nvenc constant-quality (lower = better)
)
print(stats)
# {"frames": 720, "width": 1920, "height": 1080,
#  "fps": 30, "active_codec": "h264_cuvid"}
```

Pass a `D1Config` to override the default `vivid_b` preset:

```python
from mosaic_temporal_gpu import D1Config, run_pipeline
run_pipeline(..., config=D1Config.from_preset("vivid_b"))
```

For 0.1.0 we ship the **`vivid_b`** preset only (`saturation_boost=2.10`,
`mkl_hybrid`, `neighbor_swap_rounds=5`). Additional presets and a CLI
front-end are deferred to 0.2 to keep the launch surface narrow.

The `active_codec` field in the return value is how you confirm NVDEC
engaged on the decode side (`"h264_cuvid"` / `"hevc_cuvid"`); if it
silently falls back to software, the reader raises before any frame is
processed — see the R8 assertion in `io/nvdec.py`.

## What works *today* (component-level)

```python
import torch
from mosaic_temporal_gpu import D1Config
from mosaic_temporal_gpu.kernels.cost_matrix import compute_cost_matrix_gpu
from mosaic_temporal_gpu.solvers.torch_lap import TorchLapSolver

cfg = D1Config.from_preset("vivid_b")          # ✅ schema + preset
cost = compute_cost_matrix_gpu(cells, tiles)   # ✅ GPU cost matrix (CUDA req'd)
assignment = TorchLapSolver().solve(cost)      # ✅ GPU Hungarian
```

`NvdecReader` / `NvencWriter` are likewise importable and tested on CPU host
for their error paths; full round-trip needs CUDA.

## Parity guarantee (planned, not yet wired)

The release contract is: for each frame of a fixed 24-frame synthetic clip,
SSIM(`mosaic_temporal_gpu` candidate, `mosaicraft` CPU reference) **≥ 0.98**.
The test exists (`tests/test_parity_vs_mosaicraft.py`, `@pytest.mark.parity`),
but GitHub's free runners have no CUDA, so the parity job is **not in CI
today** — it runs locally on a CUDA host with `pytest -m parity`. A scheduled
GPU runner (Modal / RunPod) is queued for 0.1.0 final. Output is not bit-exact
(GPU reductions are non-associative); the SSIM gate is the operative contract.

## Repository layout

```
src/mosaic_temporal_gpu/
  __init__.py            # version, public API (D1Config + exceptions today)
  _version.py            # single source of truth
  config.py              # D1Config schema (mirror of mosaic-temporal's GPU-valid subset)
  kernels/
    cost_matrix.py       # GPU cost matrix (torch.cdist on CUDA; Triton port = v0.2)
    oklab_grid.py        # GPU Oklab grid mean (torch view+reduce; Triton port = v0.2)
  solvers/
    torch_lap.py         # torch-linear-assignment wrapper
  io/
    nvdec.py             # PyAV NVDEC reader
    nvenc.py             # PyAV NVENC writer
  pipeline.py            # end-to-end run_pipeline (single CUDA stream;
                         # 3-stream overlap is v0.2)
tests/
  test_parity_vs_mosaicraft.py   # SSIM ≥ 0.98 gate (xfail until CUDA CI)
  test_pipeline_smoke.py         # run_pipeline public-API contract
  test_kernel_shapes.py
  test_solver_torch_lap.py
  test_io_bridges.py
  test_config_schema.py
  test_version_smoke.py
```

## Roadmap

- **0.1.0** — `run_pipeline()` shipped (single-stream NVDEC → mosaic → NVENC); parity gate green on a CUDA runner (Modal / RunPod queued); bench-spike sign-off on Kaggle T4.
- **0.2** — 3-stream CUDA overlap (`decode | compute | encode`); DLPack zero-copy on both ends of the video bridge; Triton kernels for cost matrix and Oklab grid (replace `torch.cdist` / `torch.view+mean` once we benchmark a real win); CLI front-end; additional presets.
- **0.3** — RAFT optical flow on GPU for temporal coherence; `flow_warp` module.
- **1.0** — Stable parity gate across two driver/CUDA upgrades; one breaking-change cycle behind us.

## Relation to siblings

- [mosaicraft](https://github.com/hinanohart/mosaicraft) (image mosaic, pure
  numpy/cv2/scipy) — used here as the CPU reference for the parity gate
  and for the Oklab / MKL OT / Laplacian primitives.
- [mosaic-temporal](https://github.com/hinanohart/mosaic-temporal) (video
  mosaic, CPU/GPU dual path) — the portable sibling. Same `D1Config` surface,
  so config files port between the two.

## License

MIT. See [LICENSE](LICENSE).
