Metadata-Version: 2.3
Name: syvain-training-data
Version: 0.0.120
Summary: Syvain training data manifest, loading, and saving utilities
Requires-Dist: obstore>=0.10.1,<0.11.0
Requires-Dist: pydantic>=2.13.4
Requires-Dist: torch>=2.12.1
Requires-Dist: typing-extensions>=4.15.0
Requires-Dist: pytest>=8.0.0 ; extra == 'dev'
Requires-Dist: ruff>=0.15.12 ; extra == 'dev'
Requires-Dist: ty>=0.0.34 ; extra == 'dev'
Requires-Python: >=3.12, <3.13
Provides-Extra: dev
Description-Content-Type: text/markdown

# syvain-training-data

Internal [Syvain](https://syvain.com/) data utility. No secret sauce here, just
a shared helper.

> This is my dataloader. There are many like it, but this one is mine. My
> dataloader is my best friend. It is my life. I must master it as I must master
> my life. My dataloader, without me, is useless. Without my dataloader, I am
> useless.

## Install

```bash
uv add syvain-training-data
```

## Load data

```python
from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)


def collate(records):
    ...


loader = training_data.split_data_loader(
    "s3://my-training-bucket/path/to/data-manifest-v1.json",
    collate_fn=collate,
    dataloader_args={"batch_size": 32, "num_workers": 4, ...},
)

train_batches = loader.load("train")
valid_batches = loader.load("valid")
easy_batches = loader.load("train", curriculum_stage="easy")
infinite_train_batches = loader.load("train", infinite_iter=True)
```

## Save data

```python
from concurrent.futures import ProcessPoolExecutor

from syvain_training_data import SyvainTrainingData

def generate_data(split, curriculum_stage, shard_id):
    ...

def save_shard(job):
    saver, split, curriculum_stage, shard_id = job
    records = generate_data(split, curriculum_stage, shard_id)
    saver.save(split, curriculum_stage, records)


training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

saver = training_data.dataset_saver(
    "s3://my-training-bucket/path/to/dataset/data-manifest-v1.json",
)

jobs = [
    (saver, "train", stage, shard_id)
    for stage in ["easy", "medium", "hard"]
    for shard_id in range(32)
] + [
    (saver, "valid", None, shard_id) for shard_id in range(4)
] + [
    (saver, "test", None, shard_id) for shard_id in range(4)
]

with ProcessPoolExecutor(max_workers=8) as pool:
    list(pool.map(save_shard, jobs))

manifest = saver.commit_manifest()
```

## Copy a manifest

```python
from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

manifest = training_data.load_manifest("s3://my-training-bucket/shared/data-manifest-v1.json")

# Do modifications if needed

training_data.save_manifest("s3://my-training-bucket/new-run/data-manifest-v1.json", manifest)
```
