Metadata-Version: 2.4
Name: lmms-video-utils
Version: 0.1.0
Summary: Codec-stream video frontend for LMMs (LLaVA-OneVision-2 style).
Author: LMMs-Lab
Requires-Python: >=3.10
Requires-Dist: av>=12
Requires-Dist: numpy>=1.26
Requires-Dist: opencv-python>=4.8
Requires-Dist: pillow
Requires-Dist: torch>=2.4
Provides-Extra: all
Requires-Dist: torchcodec>=0.4; extra == 'all'
Provides-Extra: dev
Requires-Dist: pandas; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Provides-Extra: gpu
Requires-Dist: torchcodec>=0.4; extra == 'gpu'
Provides-Extra: mv
Description-Content-Type: text/markdown

# lmms-video-utils

A codec-stream style video frontend for large multimodal models, modeled
after LLaVA-OneVision-2's codec tokenization. Each video is decoded,
partitioned into adaptive GOPs, and each GOP emits one I-canvas plus a
small number of P-canvases that pack the highest-scoring 2x2 patch blocks
from later frames. The output keeps a `patch_positions` table aligned with
2D-MRoPE block layouts, so downstream VLMs can place every patch back at
its source `(t, h, w)`.

## Install

```bash
pip install -e .
pip install -e .[gpu]    # add TorchCodec
pip install -e .[all]    # everything
```

PyAV is the default backend for portability; install `[gpu]` for TorchCodec.

## Three usage levels

Level 1 - direct fetch:

```python
from lmms_video_utils import fetch_codec_video
out = fetch_codec_video("clip.mp4", target_canvas=8)
print(out.canvases.shape, out.patch_positions.shape)
```

Level 2 - qwen-vl-utils-like:

```python
from lmms_video_utils import process_video_info
messages = [{"role": "user", "content": [
    {"type": "video", "video": "clip.mp4",
     "video_start": 0.0, "video_end": 5.0,
     "fps": 2.0, "max_pixels": 100_000},
    {"type": "text", "text": "describe"},
]}]
_, videos = process_video_info(messages, video_backend="codec")
```

Inline keys recognized on each `video` / `video_url` item map to
`CodecConfig` fields:

| Inline key (qwen-vl-utils) | CodecConfig field |
| --- | --- |
| `video_start` | `start_time` |
| `video_end`   | `end_time` |
| `fps`         | `target_fps` |
| `nframes`     | `max_frames` |
| `max_pixels`  | `max_pixels` |
| `min_pixels`  | `min_pixels` |

Inline overrides win over defaults passed as kwargs to
`process_video_info(messages, **defaults)`. `total_pixels` is silently
ignored.

Level 3 - reader object:

```python
from lmms_video_utils import CodecVideoReader
reader = CodecVideoReader("clip.mp4")
for i in range(len(reader)):
    canvas = reader[i]
```

## Roadmap

| Feature | Status |
| --- | --- |
| Uniform GOP, frame-diff scoring, PyAV/TorchCodec backends | implemented |
| MV-warp residual scoring (`score_mode="mvwarp"`) | implemented |
| Bit-cost-adaptive GOP (`gop_mode="bitcost"`) + per-frame bit-cost score multiplier | implemented |
| qwen-vl-utils style per-message overrides (`process_video_info`) | implemented |
| On-disk caching, batched GPU scoring | planned |
| Optional patched-libav backend for true codec residual | planned |
