Metadata-Version: 2.4
Name: trainlens-ai
Version: 1.2.3
Summary: TrainLens: Lightweight training runtime health monitor.
Author-email: Venkata Pydipalli <vsnm.tej@gmail.com>
License: Proprietary
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.0.0
Requires-Dist: pynvml>=11.5.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: numpy<2
Requires-Dist: pandas>=2.0.0
Requires-Dist: ipython>=7.0.0
Requires-Dist: ipywidgets>=7.5.0
Requires-Dist: nicegui
Requires-Dist: plotly
Requires-Dist: msgspec
Requires-Dist: msgpack
Requires-Dist: nvidia-ml-py>=13.595.45
Requires-Dist: bcrypt>=4.0.0
Provides-Extra: torch
Requires-Dist: torch>=2.5.0; extra == "torch"
Requires-Dist: torchvision>=0.20.0; extra == "torch"
Provides-Extra: wandb
Requires-Dist: wandb>=0.17; extra == "wandb"
Provides-Extra: mlflow
Requires-Dist: mlflow>=2.11; extra == "mlflow"
Provides-Extra: prometheus
Requires-Dist: prometheus-client>=0.16; extra == "prometheus"
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.20; extra == "otel"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "otel"
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.20; extra == "otel"
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20; extra == "otel"
Provides-Extra: emitters
Requires-Dist: wandb>=0.17; extra == "emitters"
Requires-Dist: mlflow>=2.11; extra == "emitters"
Requires-Dist: prometheus-client>=0.16; extra == "emitters"
Requires-Dist: opentelemetry-api>=1.20; extra == "emitters"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "emitters"
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.20; extra == "emitters"
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20; extra == "emitters"
Provides-Extra: dev
Requires-Dist: black>=26.1.0; extra == "dev"
Requires-Dist: ruff>=0.14.14; extra == "dev"
Requires-Dist: isort>=7.0.0; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: codespell>=2.4.1; extra == "dev"
Requires-Dist: nbstripout; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: coverage>=7.10.5; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: transformers; extra == "dev"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.6.0; extra == "lightning"
Provides-Extra: k8s
Requires-Dist: fastapi>=0.110.0; extra == "k8s"
Requires-Dist: uvicorn>=0.29.0; extra == "k8s"
Requires-Dist: httpx>=0.27.0; extra == "k8s"
Requires-Dist: kubernetes>=28.0.0; extra == "k8s"
Requires-Dist: scipy>=1.10.0; extra == "k8s"
Requires-Dist: bcrypt>=4.0.0; extra == "k8s"
Requires-Dist: python-multipart>=0.0.9; extra == "k8s"
Provides-Extra: ml
Requires-Dist: lifelines>=0.27; extra == "ml"
Requires-Dist: xgboost>=1.7.0; extra == "ml"
Requires-Dist: aeon>=0.9.0; extra == "ml"
Dynamic: license-file

<div align="center">

# TrainLens

**Find why PyTorch training got slow — while the run is still live.**

[![PyPI version](https://img.shields.io/pypi/v/trainlens-ai.svg)](https://pypi.org/project/trainlens-ai/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Proprietary-red)](./LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/vsnmtej/trainlens?style=social)](https://github.com/vsnmtej/trainlens)
[![GitHub issues](https://img.shields.io/github/issues/vsnmtej/trainlens)](https://github.com/vsnmtej/trainlens/issues)

[**Quickstart**](docs/quickstart.md) • [**ML Intelligence**](docs/trainlens.md) • [**Server Operations**](docs/public/11-server-side.md) • [**Known Limitations**](docs/public/02-known-limitations.md) • [**Examples**](src/examples)

</div>

TrainLens is a lightweight, step-aware bottleneck finder for PyTorch training runs. It attaches to your training loop and surfaces what is actually slowing things down — per step, per rank — without heavyweight profiling overhead.

**The gap it fills:** system dashboards show utilization over time. TrainLens shows what happened **per training step** and, in DDP, **which rank is holding the run back**.

---

## What it catches

- Input pipeline stalls (dataloader / preprocessing wait)
- Step time drift and jitter over the run
- DDP rank stragglers in single-node and multi-node setups
- Memory creep and OOM trajectory
- Gradient explosions and NaN/Inf conditions
- FSDP and Pipeline Parallel overhead (`--grad-diagnostics`)
- NCCL communication failures with root-cause attribution

---

## Supported configurations

| Configuration | Status |
|---|---|
| Single GPU | Supported |
| Single-node DDP | Supported |
| Multi-node DDP (2–4 nodes, up to ~32 ranks) | Collector can ingest this scale, but true multi-node launch needs deployment-specific launcher integration |
| Multi-node DDP (8+ nodes, 64+ ranks) | Experimental; load-test runtime, storage, and collector throughput first |
| FSDP diagnostics | Supported (`--grad-diagnostics`) |
| Pipeline Parallel bubble diagnostics | Supported (`--grad-diagnostics`) |
| Tensor Parallel diagnostics | Partial (`trace_tp_model`) |
| Full fleet backend / multi-aggregator coordination | Planned |
| TensorFlow / Keras | Planned |

---

## Quick start

```bash
pip install trainlens-ai
```

Wrap your training step:

```python
from trainlens.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
```

Run your script through TrainLens:

```bash
trainlens run train.py
```

TrainLens opens a live terminal view alongside your logs and prints a compact summary when the run ends.

See [`docs/quickstart.md`](docs/quickstart.md) for full setup details.

---

## What TrainLens shows

- Step time and its breakdown (forward / backward / optimizer / overhead)
- Dataloader and input wait per step
- Step jitter and drift over time
- GPU memory trend and OOM trajectory
- CPU / RAM / GPU utilization signals
- In DDP: worst-rank vs. median-rank timing and skew per step

This lets you tell whether a slowdown is coming from input, compute, the optimizer, or rank imbalance — before reaching for `torch.profiler`.

---

## Integrations

### Plain PyTorch

```python
from trainlens.decorators import trace_step

with trace_step(model):
    ...
```

### Hugging Face Trainer

```python
from trainlens.integrations.huggingface import TrainLensTrainer

trainer = TrainLensTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    trainlens_enabled=True,
)
```

See [`docs/huggingface.md`](docs/huggingface.md).

### PyTorch Lightning

```python
import lightning as L
from trainlens.integrations.lightning import TrainLensCallback

trainer = L.Trainer(callbacks=[TrainLensCallback()])
```

See [`docs/lightning.md`](docs/lightning.md).

---

## ML Intelligence

TrainLens watches the run in real time, estimates 5 outcome probabilities, and surfaces an action recommendation. When a termination-grade condition is confirmed, the aggregator writes a `.trainlens_terminate` signal file; `trace_step()` polls that file at step boundaries.

**Prediction chain:**

1. **ColdStartFallback** — rule-based heuristics, active from step 1
2. **XGBoost predictor** — 90-feature tabular run-outcome classifier trained from RunStore history
3. **ROCKET+Ridge predictor** — optional sequence model over per-step loss, grad norm, memory, and step time when enough labeled `step_series` data exists
4. **Parallel ensemble** — runs XGBoost and ROCKET together when both are available, with an agreement gate for alert-grade actions

**Termination signal:** when TrainLens confirms a termination-grade condition, it writes a `.trainlens_terminate` signal file. If your loop is wrapped in `trace_step()`, TrainLens checks this file at the end of each traced step. Manual checks are also available:

```python
from trainlens.runtime.auto_terminate import check_terminate_signal

for step in range(max_steps):
    with trace_step(model):
        ...
    if check_terminate_signal(session_dir):
        print("TrainLens: auto-terminating")
        break
```

**Run history:**

```bash
trainlens history                          # auto-discover from ./logs
trainlens history --db path/to/ml.db      # specific database
trainlens history --n 50                   # most recent 50 runs
```

**Train a predictor manually:**

```bash
trainlens train-model --db ./logs/<session>/aggregator/telemetry_ml.db --model-dir ./models
```

See [`docs/trainlens.md`](docs/trainlens.md) for the full ML intelligence reference.

---

## CLI reference

| Subcommand | Description |
|---|---|
| `trainlens run train.py` | Live bottleneck diagnosis |
| `trainlens deep train.py` | Adds per-layer timing and memory signals |
| `trainlens inspect telemetry.msgpack` | Decode and print binary telemetry logs |
| `trainlens history` | Review ML run outcomes from RunStore |
| `trainlens train-model` | Train an XGBoost run-outcome predictor |
| `trainlens serve` | Serve the React/FastAPI dashboard over existing logs |
| `trainlens collect` | Run a standalone TCP collector for remote GPU pods |

Add `--grad-diagnostics` to `run` or `deep` to enable gradient diagnostics:

```bash
trainlens run train.py --grad-diagnostics
trainlens deep train.py --grad-diagnostics --nproc-per-node=4
```

Activates: gradient norm tracking, NaN/Inf detection, MFU, FSDP latency, comm-overlap, and pipeline bubble ratio. Confirmed NaN/Inf conditions write the termination signal file.

---

## Optional model hooks

```python
from trainlens.decorators import trace_model_instance

trace_model_instance(model)
```

Use alongside `trace_step(model)` for per-layer timing and memory signals. The core step-level view works without it.

---

## Scope

TrainLens is for lightweight diagnosis during real PyTorch training runs.

It is **not**:

- a kernel-level tracer
- a general-purpose auto-tuner
- a replacement for `torch.profiler` for deep kernel analysis
- a managed fleet observability platform

Start with TrainLens when you need a fast answer. Reach for deeper profiling after you know where to look.

---

## Feedback

If TrainLens caught a slowdown, please open an issue and include:

- hardware / CUDA / PyTorch versions
- single GPU or DDP
- whether you used core tracing only or model hooks
- the end-of-run summary
- a minimal repro if possible

Email: vsnm.tej@gmail.com

---

## Contributing

External contribution workflow is currently managed through GitHub issues and email. Please open an issue with a minimal reproduction before sending a patch.

---

## License

Proprietary. Copyright 2026 Venkata Pydipalli. All Rights Reserved.

Use of this software requires explicit written permission. See [`LICENSE`](LICENSE) for details or contact vsnm.tej@gmail.com.
