Metadata-Version: 2.4
Name: traceml-ai
Version: 0.2.7
Summary: TraceML: Lightweight training runtime health monitor.
Author-email: Abhinav Srivastav <abhinav@traceopt.ai>
License: Apache 2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.0.0
Requires-Dist: pynvml>=11.5.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: numpy<2
Requires-Dist: pandas>=2.0.0
Requires-Dist: ipython>=7.0.0
Requires-Dist: ipywidgets>=7.5.0
Requires-Dist: nicegui
Requires-Dist: plotly
Requires-Dist: msgspec
Provides-Extra: torch
Requires-Dist: torch>=2.5.0; extra == "torch"
Requires-Dist: torchvision>=0.20.0; extra == "torch"
Provides-Extra: dev
Requires-Dist: black>=26.1.0; extra == "dev"
Requires-Dist: ruff>=0.14.14; extra == "dev"
Requires-Dist: isort>=7.0.0; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: codespell>=2.4.1; extra == "dev"
Requires-Dist: nbstripout; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: coverage>=7.10.5; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: transformers; extra == "dev"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.6.0; extra == "lightning"
Dynamic: license-file

<div align="center">

# TraceML

**Find why PyTorch training is slow while the job is still running.**

[![PyPI version](https://img.shields.io/pypi/v/traceml-ai.svg)](https://pypi.org/project/traceml-ai/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](./LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/traceopt-ai/traceml?style=social)](https://github.com/traceopt-ai/traceml)

[**Quickstart**](docs/quickstart.md) • [**How to Read Output**](docs/how-to-read-output.md) • [**FAQ**](docs/faq.md) • [**Use with W&B / MLflow**](docs/use-with-wandb-mlflow.md) • [**Issues**](https://github.com/traceopt-ai/traceml/issues)


</div>

TraceML helps you find training bottlenecks in PyTorch while the job is still running.
It helps you catch:

- input bottlenecks
- compute-bound steps
- DDP stragglers
- wait-heavy training
- memory creep over time

without jumping straight to a heavyweight profiler.

**Why this exists:** dashboards show utilization and curves. TraceML shows **why throughput is poor inside the training step**.

---

## The fastest way to try it

Install:

```bash
pip install traceml-ai
```

Wrap your training step:

```python
from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
```

Run:

```bash
traceml run train.py
```


During training, TraceML opens a live terminal view alongside your logs.


![TraceML terminal dashboard](docs/assets/cli_demo_v1.png)

At the end of the run, it prints a compact summary you can review or share.

![TraceML summary](docs/assets/end-of-run-summary.png)

For full setup details, see [docs/quickstart.md](docs/quickstart.md).

Not sure how to interpret the output? Read [How to Read TraceML Output](docs/how-to-read-output.md).

---

## What TraceML tells you

TraceML helps answer questions like:

- Is training input-bound or compute-bound?
- Is one DDP rank slower than the others?
- Is the job wait-heavy because of uneven progress?
- Is memory drifting upward over time?
- Is the slowdown coming from dataloader, forward, backward, or optimizer work?

---

## When to use TraceML

Use TraceML when training feels:

- slower than expected
- unstable from step to step
- imbalanced across distributed ranks
- fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal.
Reach for `torch.profiler` once you know where to dig deeper.

---

## How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.

Use those for:

- experiment tracking
- artifacts
- dashboards
- team reporting

Use TraceML for:

- bottleneck diagnosis
- rank imbalance / straggler detection
- memory trend debugging

See [Use TraceML with W&B / MLflow](docs/use-with-wandb-mlflow.md).


---

## Current support

**Works today:**

- single GPU
- single-node DDP/FSDP

**Not yet:**

- multi-node
- tensor parallel
- pipeline parallel

---

## Next steps

- [Quickstart](docs/quickstart.md)
- [Examples](examples/README.md)
- [How to Read TraceML Output](docs/how-to-read-output.md)
- [FAQ](docs/faq.md)
- [Use TraceML with W&B / MLflow](docs/use-with-wandb-mlflow.md)
- Hugging Face integration: `docs/huggingface.md`
- PyTorch Lightning integration: `docs/lightning.md`

---

## Feedback

If TraceML helped you find a slowdown, please open an issue and include:

- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU
- whether you used `run`, `watch`, or `deep`
- the end-of-run summary
- a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai

---

## Contributing

Contributions are welcome, especially:

- reproducible slowdown cases
- bug reports
- docs improvements
- integrations
- examples

---

## License

Apache 2.0. See [LICENSE](LICENSE).
