Metadata-Version: 2.4
Name: trainkeeper
Version: 0.3.0
Summary: Production-grade ML training toolkit with distributed training, GPU profiling, smart checkpointing, and interactive dashboards
Author: Mohamed Salem
License-Expression: Apache-2.0
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Requires-Dist: numpy>=1.23
Requires-Dist: pandas>=1.5
Requires-Dist: requests>=2.28
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: vision
Requires-Dist: torchvision; extra == "vision"
Provides-Extra: nlp
Requires-Dist: datasets; extra == "nlp"
Requires-Dist: transformers; extra == "nlp"
Requires-Dist: torch; extra == "nlp"
Provides-Extra: tabular
Requires-Dist: scikit-learn; extra == "tabular"
Requires-Dist: openml; extra == "tabular"
Provides-Extra: wandb
Requires-Dist: wandb>=0.16; extra == "wandb"
Provides-Extra: mlflow
Requires-Dist: mlflow>=2.9; extra == "mlflow"
Provides-Extra: dashboard
Requires-Dist: streamlit>=1.28; extra == "dashboard"
Requires-Dist: plotly>=5.17; extra == "dashboard"
Requires-Dist: pyyaml>=6.0; extra == "dashboard"
Provides-Extra: serving
Requires-Dist: onnx>=1.14; extra == "serving"
Requires-Dist: onnxruntime>=1.15; extra == "serving"
Requires-Dist: torch-model-archiver>=0.7; extra == "serving"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: mkdocs>=1.5; extra == "dev"
Requires-Dist: mkdocs-material>=9.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: bench
Requires-Dist: matplotlib>=3.7; extra == "bench"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: streamlit>=1.28; extra == "all"
Requires-Dist: plotly>=5.17; extra == "all"
Requires-Dist: wandb>=0.16; extra == "all"
Requires-Dist: mlflow>=2.9; extra == "all"
Requires-Dist: onnx>=1.14; extra == "all"
Dynamic: license-file

<div align="center">
  <img src="https://raw.githubusercontent.com/mosh3eb/TrainKeeper/main/assets/branding/trainkeeper-logo.png" alt="TrainKeeper Logo" width="100%">
  
  <br>

  [![PyPI Version](https://img.shields.io/pypi/v/trainkeeper?style=for-the-badge&color=blue)](https://pypi.org/project/trainkeeper/)
  [![Python Versions](https://img.shields.io/pypi/pyversions/trainkeeper?style=for-the-badge&color=green)](https://pypi.org/project/trainkeeper/)
  [![License](https://img.shields.io/badge/license-Apache--2.0-orange?style=for-the-badge)](LICENSE)

  <h3>Production-Grade Training Guardrails for PyTorch</h3>

  <p>Reproducible • Debuggable • Distributed • Efficient</p>
</div>

---

**TrainKeeper** is a minimal-decision, high-signal toolkit for building robust ML training systems. It adds guardrails **inside** your training loops without replacing your existing stack (PyTorch, Lightning, Accelerate).

## ⚡️ Why TrainKeeper?

Most failures happen **silently** inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.

- **🔒 Zero-Surprise Reproducibility**: Automatic seed setting, environment capture, and git state locking.
- **🛡️ Data Integrity**: Schema inference and drift detection caught *before* training wastes GPU hours.
- **🚅 Distributed Made Easy**: Auto-configured DDP and FSDP with a single line of code.
- **📉 Resource Efficiency**: GPU memory profiling and smart checkpointing that respects disk limits.

## 📦 Installation

```bash
pip install trainkeeper
```

## 🚀 Quick Start

Wrap your entry point to effectively "freeze" the experimental conditions:

```python
from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    print("TrainKeeper is running: Experiment is now reproducible.")

if __name__ == "__main__":
    train()
```

## ✨ Features at a Glance

### 1. Distributed Training (DDP & FSDP)
Stop fighting with `torchrun`.

```python
from trainkeeper.distributed import distributed_training, wrap_model_fsdp

with distributed_training() as dist_config:
    model = MyModel()
    model = wrap_model_fsdp(model, dist_config)  # FSDP with auto-wrapping!
```

### 2. GPU Memory Profiler
Find leaks and optimize batch sizes automatically.

```python
from trainkeeper.gpu_profiler import GPUProfiler

profiler = GPUProfiler()
profiler.start()
# ... training loop ...
print(profiler.stop().summary())
# Output: "Fragmentation detected (35%). Suggestion: Empty cache at epoch end."
```

### 3. Interactive Dashboard
Explore experiments, compare metrics, and analyze drift.

```bash
pip install trainkeeper[dashboard]
tk dashboard
```

## 🔗 Links

- **GitHub Repository**: [mosh3eb/TrainKeeper](https://github.com/mosh3eb/TrainKeeper)
- **Full Documentation**: [Read the Docs](https://github.com/mosh3eb/TrainKeeper/tree/main/docs)

