Metadata-Version: 2.4
Name: online-dynamic-batching
Version: 0.1.2
Summary: Online Dynamic Batching (ODB) — a PyTorch DataLoader-side integration that dynamically groups sequences by length and adjusts batch sizes on-the-fly.
Author: Online Dynamic Batching Contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/online-dynamic-batching/online-dynamic-batching
Project-URL: Documentation, https://github.com/online-dynamic-batching/online-dynamic-batching/tree/main/docs
Project-URL: Repository, https://github.com/online-dynamic-batching/online-dynamic-batching
Project-URL: Issues, https://github.com/online-dynamic-batching/online-dynamic-batching/issues
Project-URL: Changelog, https://github.com/online-dynamic-batching/online-dynamic-batching/blob/main/CHANGELOG.md
Keywords: pytorch,dataloader,dynamic-batching,llm,training
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Provides-Extra: hf
Requires-Dist: transformers>=4.40; extra == "hf"
Provides-Extra: accelerate
Requires-Dist: accelerate>=0.28; extra == "accelerate"
Provides-Extra: lightning
Requires-Dist: lightning>=2.1; extra == "lightning"
Provides-Extra: all
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: accelerate>=0.28; extra == "all"
Requires-Dist: lightning>=2.1; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Provides-Extra: release
Requires-Dist: build>=1.2.0; extra == "release"
Requires-Dist: twine>=5.0.0; extra == "release"
Dynamic: license-file

# Online Dynamic Batching

[![CI](https://github.com/online-dynamic-batching/online-dynamic-batching/actions/workflows/ci.yml/badge.svg)](https://github.com/online-dynamic-batching/online-dynamic-batching/actions/workflows/ci.yml)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](pyproject.toml)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Status](https://img.shields.io/badge/status-beta-blue.svg)](ROADMAP.md)
[![arXiv](https://img.shields.io/badge/arXiv-2606.19989-b31b1b.svg)](https://arxiv.org/abs/2606.19989)

[![PyTorch](https://img.shields.io/badge/PyTorch-DataLoader-EE4C2C?logo=pytorch&logoColor=white)](docs/integration-guides/pytorch-loop.md)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Trainer-FFD21E?logo=huggingface&logoColor=black)](docs/integration-guides/hf-trainer.md)
[![LLaMA-Factory](https://img.shields.io/badge/LLaMA--Factory-adapter-4B8BBE?logo=github&logoColor=white)](docs/integration-guides/llamafactory.md)
[![Accelerate](https://img.shields.io/badge/Accelerate-loops-7C3AED?logo=huggingface&logoColor=white)](docs/integration-guides/accelerate.md)
[![Lightning](https://img.shields.io/badge/Lightning-Trainer-792EE5?logo=lightning&logoColor=white)](docs/integration-guides/lightning.md)

Online Dynamic Batching (ODB) speeds up LLM/VLM training with one
PyTorch DataLoader line.

Replace your PyTorch `DataLoader` constructor with `odb.ODBDataLoader(...)` to
enable online dynamic batching at the DataLoader boundary. For frameworks that
own the DataLoader or Trainer, use one of the [more integration
methods](docs/integration-guides/README.md).

It waits until each sample has passed through the real input pipeline:
tokenization, chat templates, image-token expansion, truncation, augmentation,
and collation inputs. ODB then forms token-budgeted batches online. Short
examples get larger batches, long examples get smaller batches, and your model,
optimizer, attention kernels, and dataset format can stay where they are.

ODB is deliberately an adapter-layer package. It does not try to make
different multimodal processors, chat templates, or dataset implementations
produce identical tensors. Bring your framework's existing Dataset/collator
path; ODB starts once that path can emit fully processed single samples.

![ODB online grouping animation](docs/assets/online-grouping.svg)

## Paper

ODB is described in the arXiv paper
[**Online Dynamic Batching with Formal Guarantees for LLM Training**](https://arxiv.org/abs/2606.19989).

```bibtex
@misc{li2026online,
  title         = {Online Dynamic Batching with Formal Guarantees for LLM Training},
  author        = {Dian Li and Zekun Wang and Yaoru Wang and Jiahong Yan},
  year          = {2026},
  eprint        = {2606.19989},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DC},
  url           = {https://arxiv.org/abs/2606.19989}
}
```

```python
import odb

# One-line acceleration path: replace DataLoader(...) with ODBDataLoader(...).
dataloader = odb.ODBDataLoader(
    dataset,
    token_budget=16384,
    batch_size=1,
    shuffle=True,
    num_workers=4,
    prefetch_factor=64,
    collate_fn=collate_fn,
    loss_scaling="exact",
    join=True,                  # default; set join=False only when needed
)

for batch in dataloader:
    info = odb.pop_step_info(batch, loss_scaling="exact")
    loss = model(**batch).loss
    loss = loss * info.loss_scale
    loss.backward()
```

Need a framework-owned DataLoader or Trainer adapter? See
[more integration methods](docs/integration-guides/README.md) for HuggingFace
Trainer, LLaMA-Factory/LLaVA-Factory, Accelerate, and Lightning.

## Why ODB

Modern training pipelines often do not know the true training length at dataset
index time. A multimodal or instruction-tuning sample may change length after:

- applying a chat template;
- expanding images into vision tokens;
- truncating to a cutoff;
- adding stochastic augmentation;
- mixing multiple data sources with different processors.

Classic fixed-size batching wastes padding. Offline length caches can help, but
they need a separate preprocessing pass and can go stale when the runtime input
pipeline changes. ODB moves batching to the point where real length is already
observable: the DataLoader/collate boundary.

## What You Get

- **DataLoader replacement path**: use `ODBDataLoader(...)` when you control
  DataLoader construction.
- **Existing DataLoader path**: use `odb.apply(dataloader, ...)` when a
  framework has already created the DataLoader.
- **DDP-ready dynamic batching**: ODB aligns grouping across ranks with a small
  metadata exchange.
- **Default join-mode protocol**: strict identity-coverage termination for final
  DDP training runs; set `join=False` only for constrained runtimes that cannot
  support drain-before-finish semantics.
- **Correct loss scaling**: `odb.pop_step_info(...)` returns the current
  all-rank sample count and the per-rank loss multiplier.
- **Trainer integrations**: PyTorch loops, HuggingFace Trainer,
  LLaMA-Factory-style trainers, Accelerate loops, and Lightning modules.
- **Production-shaped benchmark coverage**: text, multimodal, LoRA/full FT,
  single-node, multi-node, oracle baselines, and high-variance production mixes.

## Integration Boundary

ODB operates after your input pipeline has converted raw records into tensors:

```text
raw data
  -> model/framework processor adapter
  -> ODB-ready single-sample tensors
  -> ODB
  -> Trainer/loop
```

The model/framework processor adapter is where model-specific work belongs:
chat templates, tokenization, image or video processors, visual-token expansion,
truncation, and label masking. Different models can use different adapters, but
they should emit the same ODB contract: a single-sample tensor dict with
`input_ids`, `attention_mask`, `labels`, and any model-required multimodal
tensors.

The core package and non-LLaMA-Factory adapters do not import or require
LLaMA-Factory. HuggingFace Trainer, Accelerate, and Lightning users should keep
their own tokenizer/processor/template/collator semantics and use ODB only at
the DataLoader/trainer boundary. If you need a paper-style MM-Mix reference,
use the separate LLaMA-Factory-based example project.

For raw multimodal records, the framework adapter is not a replacement for a
model-specific processor pipeline. Make the Dataset emit ODB-ready tensor
samples first, then attach ODB to the DataLoader or Trainer.

## Installation

From PyPI:

```bash
pip install online-dynamic-batching

# HuggingFace Trainer / LLaMA-Factory adapters
pip install "online-dynamic-batching[hf]"

# Accelerate or Lightning adapters
pip install "online-dynamic-batching[accelerate]"
pip install "online-dynamic-batching[lightning]"
```

From GitHub:

```bash
pip install "online-dynamic-batching @ git+https://github.com/online-dynamic-batching/online-dynamic-batching.git"
```

Local development:

```bash
git clone https://github.com/online-dynamic-batching/online-dynamic-batching.git
cd online-dynamic-batching
pip install -e ".[dev,all]"
pytest
```

## Quick Start

### Replace DataLoader Construction

Use this when you own the DataLoader code.

```python
import odb

dataloader = odb.ODBDataLoader(
    dataset,
    token_budget=16384,
    batch_size=1,              # ODB forms the real batch dynamically
    shuffle=True,
    num_workers=4,             # ODB requires worker prefetching
    prefetch_factor=64,
    collate_fn=collate_fn,
    loss_scaling="exact",      # "none", "approx", or "exact"
    join=True,                  # default; set join=False only when needed
)
```

### Patch An Existing DataLoader

Use this when a framework constructs the DataLoader for you.

```python
from torch.utils.data import DataLoader
import odb

dataloader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=True,
    num_workers=4,
    prefetch_factor=64,
    collate_fn=collate_fn,
)

handle = odb.apply(
    dataloader,
    token_budget=16384,
    loss_scaling="exact",
    join=True,                  # default; set join=False only when needed
)
```

### Consume ODB Metadata Before Forward

ODB adds trainer-facing metadata to each yielded batch. Remove it before
`model(**batch)` and use it for correct progress/loss accounting.

```python
for batch in dataloader:
    info = odb.pop_step_info(batch, loss_scaling="exact")

    loss = model(**batch).loss
    loss = loss * info.loss_scale
    loss.backward()

    emitted_samples += info.all_samples_this_step
```

`info.all_samples_this_step` is the all-rank emitted sample count for the
current micro-step. `info.loss_scale` is the current-rank multiplier that makes
DDP gradient averaging match the intended global sample/token weighting.

## More Integration Methods

Start with `ODBDataLoader(...)` when you control DataLoader construction. If a
framework owns the DataLoader or Trainer, choose one of these alternatives.

| Method | Best For | What ODB Handles |
| --- | --- | --- |
| Patch an existing DataLoader | Framework-created DataLoaders | `odb.apply(dataloader, ...)` adds ODB without changing the constructor site. |
| Enable HF Trainer | ODB-ready HuggingFace Trainer pipelines | `enable_odb(...)` wires DataLoader grouping, metadata, and Trainer accounting. |
| Configure an existing Trainer | Existing HuggingFace-style trainer instances needing lower-level control | `configure_trainer(...)` registers callbacks and loss scaling. |
| Enable LLaMA-Factory | ODB-ready LLaMA-Factory data pipelines | `enable_odb(...)` wires DataLoader grouping and Trainer accounting. |
| Native trainer/mixin | Framework forks or new trainers | `ODBTrainerMixin` consumes metadata inside `compute_loss`. |

### HuggingFace Trainer

```python
from odb.integrations.hf import enable_odb

trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
dataloader = trainer.get_train_dataloader()

enable_odb(
    trainer,
    train_dataloader=dataloader,
    train_dataset=dataset,
    token_budget=16384,
    loss_scaling="exact",
    join=True,
)

trainer.train()
```

### Native Trainer Class

```python
from odb.integrations.hf import ODBTrainerMixin

class MyTrainer(ODBTrainerMixin, CustomTrainer):
    pass
```

### LLaMA-Factory-Style Trainers

```python
from odb.integrations.llamafactory import enable_odb

enable_odb(
    trainer=trainer,
    train_dataloader=train_dataloader,
    training_args=training_args,
    train_dataset=train_dataset,
    token_budget=16384,
    loss_scaling="exact",
)
```

The LLaMA-Factory adapter is the complete one-line integration path when the
LLaMA-Factory data pipeline already produces ODB-ready single-sample tensor
dicts. It validates the DataLoader boundary, enables ODB grouping, and resolves
Trainer accounting such as sample-budget stopping, join mode, and exact loss
scaling. HuggingFace Trainer, Accelerate, and Lightning entries are trainer or
loop adapters until their own raw-data pipeline adapters are added.

See [docs/integration-guides](docs/integration-guides/README.md) for PyTorch,
HuggingFace Trainer, LLaMA-Factory, Accelerate, and Lightning integration details.
The 0.1.1 validation notes are summarized in [docs/validation.md](docs/validation.md).

## Try It Without Private Data

Run a CPU/single-GPU synthetic benchmark that compares fixed-size batching and
ODB on a long-tail sequence distribution:

```bash
python examples/synthetic_benchmark.py --device auto --num-samples 2048
```

For a copy-paste learning path, open
[examples/notebooks/odb_single_gpu_demo.ipynb](examples/notebooks/odb_single_gpu_demo.ipynb).

## How It Works

ODB changes batching without changing your model forward path:

1. DataLoader workers produce fully processed single samples.
2. ODB buffers the samples and observes their true runtime lengths.
3. Samples with similar length are grouped under a token budget.
4. DDP ranks exchange lightweight grouping metadata.
5. Your original `collate_fn` collates each dynamic group.
6. The trainer consumes `ODBStepInfo` for progress and loss scaling.

The resulting step size varies in samples but is much more stable in tokens.
That is the useful operating point for long-tail instruction and multimodal
training.

## API At A Glance

```python
odb.ODBDataLoader(dataset, token_budget=..., **dataloader_kwargs)
odb.apply(dataloader, token_budget=..., loss_scaling="exact")
odb.pop_step_info(batch, loss_scaling="exact")
odb.integrations.hf.configure_trainer(...)
odb.integrations.hf.ODBTrainerMixin
odb.integrations.hf.ODBTrainer
odb.integrations.accelerate.configure_accelerator(...)
odb.integrations.lightning.configure_lightning_module(...)
```

### Key Parameters

| Parameter | Meaning |
| --- | --- |
| `token_budget` | Target maximum total input length per dynamic group. Legacy name: `max_input_length`. |
| `loss_scaling` | `"none"`, `"approx"`, or `"exact"`. Use `"exact"` for strict token-weighted DDP loss scaling. |
| `join` | Enables the ODB join-mode protocol; defaults to `True`. Legacy name: `join_mode`. |
| `buffer_size` | Number of prefetched single samples available to the online grouping window. |
| `max_patches` | Optional multimodal compute cap for image-heavy workloads. |

## Benchmark Snapshot

Representative 8xH20 Qwen3-VL full fine-tuning results:

| Workload | Length CV | Standard | ODB | Speedup |
| --- | ---: | ---: | ---: | ---: |
| UltraChat 200K, 8B Full FT | 0.48 | 5.77 sam/s | 10.23 sam/s | 1.77x |
| LLaVA 150K, 8B Full FT | 0.29 | 14.38 sam/s | 24.87 sam/s | 1.73x |
| ShareGPT4o 57K, 8B Full FT | 1.00 | 2.37 sam/s | 5.83 sam/s | 2.46x |

Quality is reported alongside throughput in the paper experiments. The intended
claim is a better throughput-quality operating point under variable-length
training, not identical optimizer-update geometry.

See [docs/benchmarks.md](docs/benchmarks.md) for reporting policy and benchmark
notes.

## Integration Checklist

Use this as a quick audit before opening a PR in a training stack:

- DataLoader emits one fully processed sample at a time: `batch_size=1`.
- DataLoader uses worker prefetching: `num_workers > 0`.
- ODB is applied after the framework has selected sampler/shuffle behavior.
- Trainer removes ODB metadata before model forward.
- Trainer uses `info.loss_scale` when DDP ranks can process different local
  sample/token counts.
- Trainer progresses/stops by emitted samples when doing epoch-based training.
- Default `join=True` is paired with DDP Join or the framework's equivalent
  uneven-input handling; use `join=False` only when that runtime support is not
  available.

## Project Layout

```text
src/odb/                     # core package
src/odb/integrations/        # trainer adapters
examples/                    # minimal PyTorch/HF examples and synthetic benchmarks
docs/integration-guides/     # framework-specific integration notes
docs/benchmarks.md           # benchmark reporting policy
agent-skills/                # Codex / Claude Code assisted integration skill
```

## Build And Verify

```bash
python -m pip install -U build twine
python -m build
python -m twine check dist/*
python -m pip install dist/online_dynamic_batching-*.whl
python -c "import odb; print(odb.__version__)"
pytest
```

## Engineering Roadmap

ODB's roadmap is focused on runtime capabilities: stronger distributed-training
semantics, clearer trainer interfaces, additional batching policies, structured
observability, and reproducible benchmarking. See [ROADMAP.md](ROADMAP.md).

## Requirements

- Python 3.9+
- PyTorch 2.0+
- Optional: `transformers>=4.40` for HuggingFace Trainer integration

## Citation

If you find ODB useful, please cite the technical report:

```bibtex
@techreport{odb2025,
  title = {Online Dynamic Batching: Adaptive Batch Sizing for Variable-Length Sequence Training},
  year = {2025}
}
```

## License

Apache-2.0
