Metadata-Version: 2.4
Name: traceml-ai
Version: 0.2.14
Summary: TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.
Author-email: "OptAI UG (haftungsbeschränkt)" <support@traceopt.ai>
Maintainer-email: "OptAI UG (haftungsbeschränkt)" <support@traceopt.ai>
License: Apache 2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.0.0
Requires-Dist: nvidia-ml-py
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: numpy<2
Requires-Dist: pandas>=2.0.0
Requires-Dist: ipython>=7.0.0
Requires-Dist: ipywidgets>=7.5.0
Requires-Dist: nicegui
Requires-Dist: plotly
Requires-Dist: msgspec
Provides-Extra: torch
Requires-Dist: torch>=2.5.0; extra == "torch"
Requires-Dist: torchvision>=0.20.0; extra == "torch"
Provides-Extra: dev
Requires-Dist: black>=26.1.0; extra == "dev"
Requires-Dist: ruff>=0.14.14; extra == "dev"
Requires-Dist: isort>=7.0.0; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: codespell>=2.4.1; extra == "dev"
Requires-Dist: nbstripout; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: coverage>=7.10.5; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: transformers; extra == "dev"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.6.0; extra == "lightning"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6; extra == "docs"
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.26; extra == "docs"
Requires-Dist: mkdocs-autorefs>=1.2; extra == "docs"
Requires-Dist: mkdocs-include-markdown-plugin>=7.0; extra == "docs"
Dynamic: license-file

<div align="center">

# TraceML

**Catch PyTorch training slowdowns early, while the job is still running.**

[![PyPI version](https://img.shields.io/pypi/v/traceml-ai.svg)](https://pypi.org/project/traceml-ai/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](./LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/traceopt-ai/traceml?style=social)](https://github.com/traceopt-ai/traceml)

[**Quickstart**](docs/user_guide/quickstart.md) • [**Compare Runs**](docs/user_guide/compare.md) • [**How to Read Output**](docs/user_guide/reading-output.md) • [**FAQ**](docs/user_guide/faq.md) • [**Use with W&B / MLflow**](docs/user_guide/integrations/wandb-mlflow.md) • [**Issues**](https://github.com/traceopt-ai/traceml/issues)

</div>

TraceML is an open-source tool for catching PyTorch training slowdowns early, so bad runs do not quietly waste costly compute.

It gives you lightweight step-level signals while the job is still running, so you can quickly tell whether the slowdown looks input-bound, compute-bound, wait-heavy, imbalanced across ranks, or memory-related.

Use TraceML when you want a fast answer before reaching for a heavyweight profiler.

**⭐ If TraceML helps you, please consider starring the repo.**

> **Upcoming rename:** TraceML will transition to **TraceOpt** in a future release.
> For now, the active package remains `traceml-ai` and Python imports remain `traceml`.
> The future PyPI package name [`traceopt-ai`](https://pypi.org/project/traceopt-ai/) is now in place as we prepare the migration.

---

## The fastest way to try it

Install:

```bash
pip install traceml-ai
```

Initialize TraceML and wrap your training step:

```python
import traceml

traceml.init()

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
```

Run:

```bash
traceml run train.py
```

During training, TraceML opens a live terminal view alongside your logs.

![TraceML terminal dashboard](docs/assets/cli_demo_v1.png)

At the end of the run, it prints a compact summary you can review or share.

![TraceML summary](docs/assets/end-of-run-summary.png)

Start with `traceml run train.py`. Most users do not need `watch` or `deep` first.

For custom training loops, manual and selective instrumentation are available in the [Quickstart](docs/user_guide/quickstart.md).

---

## Core workflows

### 1. Live diagnosis

Use the default workflow when you want live step-aware diagnosis during training plus the end-of-run summary.

```bash
traceml run train.py
```

### 2. Low-noise summary runs

Use summary mode when you mainly want the structured final summary for logging into W&B or MLflow.

```bash
traceml run train.py --mode=summary
```

Then call `traceml.final_summary()` near the end of your script.

TraceML also writes canonical summary artifacts for the run, including `final_summary.json`, which is the intended machine-readable output for downstream logging and later run comparison.

### 3. Compare two runs

If you have `final_summary.json` from two runs, compare them directly:

```bash
traceml compare run_a.json run_b.json
```

TraceML writes both a structured compare JSON and a compact text report.

See [docs/user_guide/compare.md](docs/user_guide/compare.md).

---

## What TraceML helps you see

TraceML helps answer questions like:

- Is the run input-bound, compute-bound, wait-heavy, or memory-constrained?
- Are some distributed ranks slower than others?
- Is memory usage drifting upward over time?
- Where is time showing up across dataloader, forward, backward, and optimizer phases?

It is designed to help you decide quickly whether a run looks healthy or whether it is worth digging deeper.

---

## Overhead

TraceML adds fixed per-step instrumentation overhead, so the relative cost is highest when training steps are very short. In larger or distributed workloads, that fixed cost is amortized over a longer end-to-end step. In our early DDP benchmarks, TraceML did not produce a measurable slowdown beyond normal run-to-run variation.

---

## When to use TraceML

Use TraceML when training feels:

- slower than expected
- unstable from step to step
- imbalanced across distributed ranks
- fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal.
Reach for `torch.profiler` once you know where to dig deeper.

---

## How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard, not replace them.

Use experiment trackers for dashboards, artifacts, and team reporting. Use TraceML for live bottleneck diagnosis, structured final summaries, and simple run-to-run comparison from saved TraceML summary JSON files.

See [Use TraceML with W&B / MLflow](docs/user_guide/integrations/wandb-mlflow.md).

---

## Current support

**Works today:**

- single GPU
- single-node DDP/FSDP

**Next:**

- multi-node training support

---

## Learn more

- [Quickstart](docs/user_guide/quickstart.md)
- [Compare Runs](docs/user_guide/compare.md)
- [Examples](examples/README.md)
- [How to Read TraceML Output](docs/user_guide/reading-output.md)
- [FAQ](docs/user_guide/faq.md)
- [Use TraceML with W&B / MLflow](docs/user_guide/integrations/wandb-mlflow.md)
- [Hugging Face integration](docs/user_guide/integrations/huggingface.md)
- [PyTorch Lightning integration](docs/user_guide/integrations/lightning.md)

Need a lighter zero-code first look or a deeper follow-up run? See the [Quickstart](docs/user_guide/quickstart.md) and [FAQ](docs/user_guide/faq.md) for `watch` and `deep`.

---

## Feedback

If TraceML helped you catch a slowdown, please open an issue and include:

- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU
- whether you used `run`, `watch`, or `deep`
- the end-of-run summary
- a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai

---

## Contributing

Contributions are welcome, especially:

- reproducible slowdown cases
- bug reports
- docs improvements
- integrations
- examples

---

## License

Apache 2.0. See [LICENSE](LICENSE).

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).
