Metadata-Version: 2.4
Name: traceml-ai
Version: 0.2.11
Summary: TraceML: Lightweight training runtime health monitor.
Author-email: Abhinav Srivastav <abhinav@traceopt.ai>
License: Apache 2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.0.0
Requires-Dist: pynvml>=11.5.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: numpy<2
Requires-Dist: pandas>=2.0.0
Requires-Dist: ipython>=7.0.0
Requires-Dist: ipywidgets>=7.5.0
Requires-Dist: nicegui
Requires-Dist: plotly
Requires-Dist: msgspec
Provides-Extra: torch
Requires-Dist: torch>=2.5.0; extra == "torch"
Requires-Dist: torchvision>=0.20.0; extra == "torch"
Provides-Extra: dev
Requires-Dist: black>=26.1.0; extra == "dev"
Requires-Dist: ruff>=0.14.14; extra == "dev"
Requires-Dist: isort>=7.0.0; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: codespell>=2.4.1; extra == "dev"
Requires-Dist: nbstripout; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: coverage>=7.10.5; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: transformers; extra == "dev"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.6.0; extra == "lightning"
Dynamic: license-file

<div align="center">

# TraceML

**Catch PyTorch training slowdowns early, while the job is still running.**

[![PyPI version](https://img.shields.io/pypi/v/traceml-ai.svg)](https://pypi.org/project/traceml-ai/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](./LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/traceopt-ai/traceml?style=social)](https://github.com/traceopt-ai/traceml)

[**Quickstart**](docs/quickstart.md) • [**Compare Runs**](docs/compare.md) • [**How to Read Output**](docs/how-to-read-output.md) • [**FAQ**](docs/faq.md) • [**Use with W&B / MLflow**](docs/use-with-wandb-mlflow.md) • [**Issues**](https://github.com/traceopt-ai/traceml/issues)

</div>

TraceML is an open-source tool for catching PyTorch training slowdowns early, so bad runs do not quietly waste costly compute.

It gives you lightweight step-level signals while the job is still running, so you can quickly tell whether the slowdown looks input-bound, compute-bound, wait-heavy, imbalanced across ranks, or memory-related.

Use TraceML when you want a fast answer before reaching for a heavyweight profiler.

**⭐ If TraceML helps you, please consider starring the repo.**

> **Upcoming rename:** TraceML will transition to **TraceOpt** in a future release.
> For now, the active package remains `traceml-ai` and Python imports remain `traceml`.
> The future PyPI package name [`traceopt-ai`](https://pypi.org/project/traceopt-ai/) is now in place as we prepare the migration.

---

## The fastest way to try it

Install:

```bash
pip install traceml-ai
```

Initialize TraceML and wrap your training step:

```python
import traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
```

Run:

```bash
traceml run train.py
```

During training, TraceML opens a live terminal view alongside your logs.

![TraceML terminal dashboard](docs/assets/cli_demo_v1.png)

At the end of the run, it prints a compact summary you can review or share.

![TraceML summary](docs/assets/end-of-run-summary.png)

Start with `traceml run train.py`. Most users do not need `watch` or `deep` first.

> Legacy imports from `traceml.decorators` still work for backward compatibility.
> The preferred interface is now the top-level `traceml.*`.
> Legacy decorator imports are planned for deprecation starting in `v0.3.0`.

TraceML supports three initialization modes:

- `traceml.init(mode="auto")` for the default patch-based workflow
- `traceml.init(mode="manual")` for fully explicit wrapper-based instrumentation
- `traceml.init(mode="selective", ...)` when you want part automatic and part explicit

Manual and selective flows can use:

- `traceml.wrap_dataloader_fetch(...)`
- `traceml.wrap_forward(...)`
- `traceml.wrap_backward(...)`
- `traceml.wrap_optimizer(...)`

---

## Core workflows

### 1. Live diagnosis

Use the default workflow when you want live step-aware diagnosis during training plus the end-of-run summary.

```bash
traceml run train.py
```

### 2. Low-noise summary runs

Use summary mode when you mainly want the structured final summary for logging into W&B or MLflow.

```bash
traceml run train.py --mode=summary
```

Then call `traceml.final_summary()` near the end of your script.

TraceML also writes canonical summary artifacts for the run, including `final_summary.json`, which is the intended machine-readable output for downstream logging and later run comparison.


### 3. Compare two runs

If you have `final_summary.json` from two runs, compare them directly:

```bash
traceml compare run_a.json run_b.json
```

TraceML writes both a structured compare JSON and a compact text report.

See [docs/compare.md](docs/compare.md).

---

## What TraceML helps you see

TraceML is currently strongest at surfacing:

- step-time slowdowns while training is still running
- whether the pattern looks input-bound, compute-bound, or wait-heavy
- whether work is uneven across distributed ranks
- whether memory is drifting upward over time
- where time is showing up across dataloader, forward, backward, and optimizer phases

It is designed to help you decide quickly whether a run looks healthy or whether it is worth digging deeper.

---

## When to use TraceML

Use TraceML when training feels:

- slower than expected
- unstable from step to step
- imbalanced across distributed ranks
- fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal.
Reach for `torch.profiler` once you know where to dig deeper.

---

## How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.

Use those for:

- experiment tracking
- artifacts
- dashboards
- team reporting

Use TraceML for:

- bottleneck diagnosis while a run is still in progress
- spotting throughput drift during a run
- checking for rank imbalance or straggler patterns
- checking for memory creep or pressure signals
- structured final summaries you can forward into W&B or MLflow
- simple run-to-run comparison from saved TraceML summary JSON files

See [Use TraceML with W&B / MLflow](docs/use-with-wandb-mlflow.md).

---

## Current support

**Works today:**

- single GPU
- single-node DDP/FSDP

**Not yet:**

- multi-node
- tensor parallel
- pipeline parallel

`deep` remains available for deeper follow-up inspection. If `deep` is important
for your workflow, please let us know in [GitHub issues](https://github.com/traceopt-ai/traceml/issues).

---

## Learn more

- [Quickstart](docs/quickstart.md)
- [Compare Runs](docs/compare.md)
- [Examples](examples/README.md)
- [How to Read TraceML Output](docs/how-to-read-output.md)
- [FAQ](docs/faq.md)
- [Use TraceML with W&B / MLflow](docs/use-with-wandb-mlflow.md)
- Hugging Face integration: `docs/huggingface.md`
- PyTorch Lightning integration: `docs/lightning.md`

Need a lighter zero-code first look or a deeper follow-up run? See the Quickstart and FAQ for `watch` and `deep`.

---

## Feedback

If TraceML helped you catch a slowdown, please open an issue and include:

- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU
- whether you used `run`, `watch`, or `deep`
- the end-of-run summary
- a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai

---

## Contributing

Contributions are welcome, especially:

- reproducible slowdown cases
- bug reports
- docs improvements
- integrations
- examples

---

## License

Apache 2.0. See [LICENSE](LICENSE).
