Metadata-Version: 2.4
Name: robovet
Version: 0.2.0
Summary: Vet your robot datasets: diagnose, repair, and quality-score LeRobot-format episode data before you waste a training run.
Author: robovet contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/CHANGEME/robovet
Project-URL: Issues, https://github.com/CHANGEME/robovet/issues
Keywords: robotics,lerobot,dataset,imitation-learning,data-quality,vla
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=14
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13
Requires-Dist: jinja2>=3.1
Provides-Extra: video
Requires-Dist: av>=12; extra == "video"
Provides-Extra: hub
Requires-Dist: huggingface_hub>=0.23; extra == "hub"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: av>=12; extra == "dev"
Dynamic: license-file

# robovet

**Vet your robot data.** Diagnose, repair, and quality-score LeRobot-format
datasets — *before* you waste a training run on broken episodes.

```text
$ robovet doctor ./my_dataset

  FAIL DATA-104   1 episode where metadata 'length' disagrees with the parquet
                  row count — the classic signature of a corrupted episode map.
  FAIL STATS-302  1 stat block disagrees with the actual data — every training
                  run normalizes with these numbers.
  WARN TIME-202   Loading this dataset requires tolerance_s ≥ 7.7e-03
                  (77× the default). Worst: episode 2, 7.29 ms off the grid.
  FAIL META-502   Σ episode lengths = 1086 but info.json total_frames = 1037 —
                  the metadata contradicts itself before a single file is read.

  5 fail · 4 warn · 23 pass
  UNSAFE TO TRAIN — fix the FAILs first.        (exit code 1 — CI-gate it)
```

## Why this exists

Robot learning's bottleneck moved from models to data, and the data is quietly
broken. An April 2026 audit of 10 popular open-source robot datasets found
floating-point drift that breaks video decoding after ~45 episodes, a
v2.1→v3.0 conversion bug that **silently** corrupts episode↔frame mapping
(your run "works" — the policy just learns from jumbled sequences), datasets
that only load with `tolerance_s` set to 100× the default, and **no quality
metrics anywhere**. Hugging Face's own community-dataset cleaning run tells the
same story: **111 of 240 datasets failed validation** — and that pipeline is
internal, not something you can run on yours. Meanwhile the 2026 consensus is that a well-curated
500-demo fine-tune beats a poorly-curated one at 10× the scale — curation
tooling is the gap, not model size.

Every check in robovet maps to a documented, real-world failure. The receipts
— issue numbers, audit findings, papers — live in [PAIN.md](https://github.com/RonaldSit/robovet/blob/main/PAIN.md).

## Try it in 30 seconds (no robot required)

```bash
pip install robovet[video]

robovet demo ./demo          # synthetic SO-100-style dataset, 10 real-world
                             # defect classes injected (each tagged with the
                             # GitHub issue it reproduces)
robovet demo ./demo3 --v3    # same idea in the v3.0 shared-file layout
robovet doctor ./demo        # catches all of them; exit 1
robovet fix    ./demo --apply  # repairs the metadata class; .bak backups
robovet doctor ./demo        # metadata FAILs gone
robovet report ./demo -o report.html   # one self-contained, shareable page
```

`robovet demo ./demo --clean` builds the same dataset with zero defects, so
you can see what all-green looks like.

## Vet a Hub dataset *before* downloading it

```bash
pip install "robovet[hub]"
robovet doctor hf://lerobot/svla_so100_pickplace
```

Fetches **only `meta/`** (a few MB), then runs every metadata-level check:
structure, stats sanity, and the new `META-5xx` ledger cross-checks — episode↔
frame index math, Σlengths vs counters, per-episode stats freshness, video
time windows. The #2401 corruption class is visible from metadata alone, so
you find out a 4 GB dataset is broken before spending the bandwidth. Honest
scope: the verdict says **META CLEAN**, never CLEAN — values, timestamps and
video decode still need the full local doctor. `--meta-only` works on local
paths too (instant pre-check on slow disks).

## CI gate in 15 lines

```yaml
name: robovet
on: [push, pull_request]
jobs:
  vet:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install "robovet[video]"
      - run: robovet doctor ./datasets/my_task   # exit 1 on FAIL blocks the merge
```

## Saw this error? Run this

| You hit | Run | You get |
| --- | --- | --- |
| `ValueError: timestamps … tolerance_s` on load | `robovet doctor` → TIME-202 | the exact minimal `tolerance_s`, per worst episode |
| wrong frames / IndexError after a v2→v3 conversion | TIME — DATA-104/105 + META-501 | which episodes' ledgers lie, three-way cross-check |
| TorchCodec/AV1 decode errors | VIDEO-403 | per-camera codec tiers and what to re-encode |
| `loss=NaN` out of nowhere | DATA-107 + STATS-302 | NaN/Inf locations and stale-normalization blocks |

## What it checks

| Group | Catches | Maps to |
|---|---|---|
| `STRUCT-0xx` | missing/invalid metadata, dangling episodes, orphan files | lerobot#761 (no validator for hand-rolled conversions) |
| `DATA-1xx` | **episode↔frame mapping corruption**, schema drift, NaN/Inf, dead dims | lerobot#2401 (silent v2.1→v3.0 corruption) |
| `TIME-2xx` | off-grid timestamps **with the exact `tolerance_s` you'd need**, non-monotonic time, cumulative FP drift | lerobot#933, lerobot#3177 |
| `STATS-3xx` | stored normalization stats that disagree with the data ("normalization poison"), **broken quantile stats** (q01/q99) | HF docs warning; phospho repair post; lerobot#2189 |
| `VIDEO-4xx` | video/parquet frame-count desync — **including per-episode windows inside shared v3 files**, codec-aware compatibility tiers (h264 ✓ / AV1 info — it's lerobot's own default / mpeg4-hevc warn), fps mismatch | Correll-lab postmortem; phospho notes |

`robovet doctor` exits **1** on any FAIL and takes `--json`, so it drops
straight into CI: gate dataset merges the way Codecov gates coverage.

## Quality scoring (triage, not truth)

```bash
robovet score ./my_dataset --csv scores.csv
```

Per-episode signals, all computed in one pass: jerk smoothness, idle ratio,
gripper chatter, duration outliers, action saturation, exact duplicates.
This is deliberately the *cheap first pass* — the smoothness-first approach
the 2026 curation literature (rinse, Demo-SCORE, QoQ) argues should precede
expensive policy-rollout or influence-function filtering. Scores put the worst
episodes in front of a human in seconds; **review before you delete**.
Statistical flags carry practical-significance guards, so homogeneous datasets
don't self-flag.

## Repair contract

`robovet fix` is **dry-run by default**. With `--apply` it rewrites only
metadata (episode lengths, normalization stats, info.json counters), backs up
every touched file as `.bak`, never modifies parquet or video payloads, and
**preserves everything it doesn't understand**: quantile keys (q01/q99 — the
v3 QUANTILES-normalization era), image-stat blocks, and unknown episode fields
such as tags. A repair tool must never be the thing that deletes your data;
the test suite enforces these guarantees. Frame surgery
(tail-trimming desynced episodes, timestamp re-gridding) is on the
[roadmap](https://github.com/RonaldSit/robovet/blob/main/ROADMAP.md) under the same contract.

## Scope, honestly

- **LeRobot v2.0 / v2.1 and v3.x are both first-class for diagnosis** — each
  has its own synthetic fixture and test suite, and v3 gets per-episode video
  alignment inside shared files (`VIDEO-405`) plus per-episode stats checks
  parsed from the v3 metadata. `fix` currently rewrites v2.x episode metadata
  and global stats; v3 per-episode stats regeneration is on the roadmap.
- robovet does **not** merge/split/delete episodes — `lerobot` ships that
  natively now. We do what the official stack doesn't: deep validation,
  metadata repair, and quality triage.
- Local-first by design. Your data never leaves your disk — deployment-specific
  data is a competitive asset; treat it like one.

## Library use

```python
from robovet import load_dataset, run_doctor, score_dataset

ds = load_dataset("./my_dataset")
rep = run_doctor(ds)          # rep.exit_code, rep.results, rep.counts
sc  = score_dataset(ds, scan=rep.scan)   # reuses the same single IO pass
```

Apache-2.0. Issues and broken-dataset war stories very welcome — if your
dataset breaks in a way robovet doesn't catch, that's a bug report we want.
