Metadata-Version: 2.4
Name: dataio-rs
Version: 0.2.1
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Summary: High-performance data processing library for ML workloads.
Home-Page: https://github.com/Mikubill/dataio-rs
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Package, https://pypi.org/project/dataio-rs/
Project-URL: Repository, https://github.com/Mikubill/dataio-rs

# dataio

[![CI](https://github.com/Mikubill/dataio-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/Mikubill/dataio-rs/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/dataio-rs.svg)](https://pypi.org/project/dataio-rs/)

A Rust/PyO3 data plane for Python ML — typed pipelines, async IO,
zero-copy DLPack to PyTorch / NumPy / JAX.

- Zero-copy handoff to torch / numpy / jax 
- Slot-write fast path: stackable transforms write straight into the batch buffer
- ~1.85-2× faster than `torch.utils.data.DataLoader` on normal workloads

## Install

```bash
pip install dataio-rs
```

## Quickstart

```python
import dataio

dataset = (
    dataio.from_files("data/images/**/*.jpg")
    .shuffle(seed=42)
    .decode_image(mode="rgb")
    .resize_short(256)
    .center_crop(224)
    .normalize("imagenet")
)

with dataset.batched(128).load() as loader:
    for batch in loader:
        train_step(batch.tensor, batch.keys)
```

`load()` returns a `DataLoader` with sane defaults (`output="torch"`,
`concurrency=16`, `prefetch=4`, `pin_memory="auto"`). `batch.tensor` /
`batch.tensors` lazily DLPack-wraps the requested framework;
`batch.metadata`, `batch.keys`, `batch.indices`, `batch.errors` track
the surviving samples.

### Custom batching

When the chain doesn't fit (bucket-uniform sampling, joint task batching,
length-aware packing, multi-output samples) drive the loader with your
own factory — `from_batches(fn)` is pass-through, no rebatching:

```python
def batches():                          # zero-arg factory ⇒ multi-epoch
    for chunk in batch_sampler:
        yield [my_to_sample(r) for r in chunk]

with dataio.from_batches(batches).load() as loader:
    for batch in loader: ...
```

For pre-built individual samples without your own batching:
`dataio.from_samples(it).batched(N).load()`.

## API

```text
dataio.from_files(glob)        → Dataset    typed chain (decode/transform/batched)
dataio.from_records(records)   → Dataset
dataio.from_manifest(path)     → Dataset
dataio.from_batches(fn)        → Batches    pass-through, custom batching
dataio.from_samples(it)        → Samples    auto-batched via .batched(N)

Dataset.batched(N)             → Batches    chain terminal
Samples.batched(N)             → Batches
Batches.load(**knobs)          → DataLoader runtime (iterable + lifecycle)
```

`load()` knobs: `output ∈ {"torch","numpy","jax","dlpack"}`,
`concurrency`, `prefetch`, `order ∈ {"submit","completion"}`,
`pin_memory ∈ {True, False, "auto"}`, `ragged ∈ {"list","error","skip"}`,
`failure_policy ∈ {"drop","raise"}`, `min_survivors`.

The `Batches` spec carries no runtime state — cheap to build, hold, and
pass around; resources allocate only at `.load(...)`.

For multi-output samples, archive entry reads, or hand-built pipelines:
`dataio.lib.{Sample, Pipeline, Source, Decoder, Transform, BytesOp}`.
See `python/dataio/lib.pyi`.

## Errors and diagnostics

```python
loader = dataset.batched(64).load(failure_policy="drop")
batch = next(iter(loader))
print(batch.errors)         # [{index, key, stage, message}, ...]
print(loader.diagnostics()) # samples_submitted/completed/failed, queue stats
```

`failure_policy="raise"` (with `min_survivors=N`) aborts the run when
fewer than `N` samples in a batch survive.

## Benchmark

Synthetic 1024 RGB images, batch 32, workers/concurrency 8, CPU-only:

| Loader | samples/s | speedup | p99 |
|---|---:|---:|---:|
| `torch.utils.data.DataLoader` | 3,066 | 1.00× | 38 ms |
| `dataio` (typed chain) | 5,667 | 1.85× | 9 ms |

Real I/O-bound workloads (S3/R2 fetches dominating) see the absolute
samples/s gap shrink, but the tail-latency benefit and freeing the
trainer thread from data-side blocking remain.

```bash
uv run python benches/bench_loader_matrix.py --image-dir /path/to/images
# or --synthetic /tmp/imgs --n 1024
```

## Development

```bash
cargo test
uvx ruff check python examples benches
uv run python -m unittest discover -s python/tests -p 'test_*.py'
uvx maturin build --release
```

```bash
# Editable install for local hacking:
uv run --with maturin maturin develop --release
```

CI runs the lint job (`cargo fmt`, `cargo clippy`, `uvx ruff`) and the
builder job (`cargo test`, wheel build, install, Python unittest) on
every push.

Examples in `examples/` cover Dataset chain, archive reads, `.npy`,
`.safetensors`. Full PyO3 surface in `python/dataio/lib.pyi`.

---

Source: https://github.com/Mikubill/dataio-rs · PyPI: https://pypi.org/project/dataio-rs/

