Metadata-Version: 2.4
Name: qlqoqrqa
Version: 0.1.0
Summary: PyTorch training accelerator: AMP, torch.compile, fused kernels, EMA, SWA, schedulers, memory tools, DDP — all in one drop-in wrapper.
License: MIT
Project-URL: Homepage, https://github.com/yourusername/qlqoqrqa
Project-URL: Issues, https://github.com/yourusername/qlqoqrqa/issues
Keywords: deep-learning,pytorch,training,accelerator,mixed-precision,optimization,distributed
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# qlqoqrqa ⚡

> A serious PyTorch training accelerator. Drop in one wrapper, eliminate common bottlenecks.

---

## Install

```bash
pip install qlqoqrqa
```

**Requirements:** Python ≥ 3.9, PyTorch ≥ 2.0

---

## Quickstart

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from qlqoqrqa import Accelerator, AcceleratorConfig

model     = nn.TransformerEncoder(...)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
loader    = DataLoader(TensorDataset(X, y), batch_size=256, shuffle=True)

cfg = AcceleratorConfig(
    dtype              = "bf16",   # BF16 AMP on Ampere+, FP16+scaler elsewhere
    compile            = True,     # torch.compile with Triton kernel fusion
    grad_accum_steps   = 4,        # effective batch_size = 256 × 4 = 1024
    gradient_checkpointing = False,
    fused_optimizer    = True,
    turbo_dataloader   = True,
    max_grad_norm      = 1.0,
)
acc = Accelerator(model, optimizer, loader, config=cfg)

for epoch in range(10):
    for batch in acc.dataloader:
        loss = acc.step(batch, forward_fn=lambda m, b: F.cross_entropy(m(b[0]), b[1]))
```

---

## Optimization techniques

| Technique | What it does |
|-----------|-------------|
| **BF16 / FP16 AMP** | Half-precision forward + backward; FP16 uses loss scaling, BF16 doesn't need it |
| **torch.compile** | Triton-generated fused GPU kernels via max-autotune |
| **Fused optimizer** | `fused=True` Adam/SGD: one CUDA kernel instead of a Python loop per param |
| **Turbo DataLoader** | `pin_memory`, `persistent_workers`, CUDA stream prefetch |
| **set_to_none zero_grad** | Avoids memset overhead on gradient buffers |
| **channels_last** | Contiguous NHWC layout — free speedup on conv-heavy nets |
| **TF32** | 19-bit matmul on Ampere+ with full FP32 accumulation |
| **cuDNN benchmark** | Auto-selects fastest conv algorithm per input shape |
| **Gradient accumulation** | Simulates large batches without extra VRAM |
| **Gradient checkpointing** | Recompute activations during backward — halves peak VRAM |
| **Gradient clipping** | Stabilises training, prevents loss spikes |
| **DDP (multi-GPU)** | Auto-detected via `RANK` env var (torchrun) |
| **EMA weights** | Exponential moving average — better generalisation |
| **SWA** | Stochastic Weight Averaging — flatter minima |
| **Warmup + cosine LR** | Standard LR schedule for transformers |
| **Early stopping** | Stops when validation stops improving |
| **Memory tracker** | Measures peak VRAM usage per phase |
| **Auto batch size** | Binary-searches the largest batch that fits in VRAM |
| **Activation checkpointing** | Per-module gradient checkpointing for arbitrary architectures |

---

## High-level Trainer

```python
from qlqoqrqa import Trainer, EMA, get_scheduler

ema = EMA(model, decay=0.9999)
scheduler = get_scheduler("cosine", optimizer, warmup_steps=500, total_steps=10000)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    train_loader=train_loader,
    val_loader=val_loader,
    forward_fn=lambda m, b: m(b["input_ids"]),
    loss_fn=lambda out, b: F.cross_entropy(out, b["labels"]),
    metric_fn=lambda out, b: {"loss": F.cross_entropy(out, b["labels"]).item()},
    scheduler=scheduler,
    ema=ema,
    epochs=20,
    early_stopping_patience=5,
    checkpoint_path="checkpoints/best.pt",
)
history = trainer.fit()
```

---

## Optimizers

```python
from qlqoqrqa import FusedAdamW, Lion

# foreach-vectorized AdamW
opt = FusedAdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Lion — 3× less memory than Adam, sign-only momentum
opt = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99))
```

---

## Memory tools

```python
from qlqoqrqa import MemoryTracker, find_batch_size, apply_activation_checkpointing, empty_cache

# Find the biggest batch that fits in VRAM
bs = find_batch_size(model, sample_input_fn=lambda n: torch.randn(n, 3, 224, 224))

# Track peak VRAM usage
tracker = MemoryTracker()
tracker.start()
train_one_epoch(...)
print(tracker.stop())
# {'peak_mb': 8192.0, 'delta_mb': 7680.0, 'utilization_pct': 80.0, ...}

# Apply activation checkpointing to specific layer types
from mymodel import TransformerBlock
model = apply_activation_checkpointing(model, module_types=(TransformerBlock,))

# Release memory between train and eval
empty_cache()
```

---

## Benchmark

```python
from qlqoqrqa import benchmark, compare_speedup

baseline  = benchmark(lambda: model_eager(x),   batch_size=64, n_runs=100)
compiled  = benchmark(lambda: model_compiled(x), batch_size=64, n_runs=100)
compare_speedup(baseline, compiled)
```

```
────────────────────────────────────────────────────────
  Metric                      Baseline       qlqoqrqa
────────────────────────────────────────────────────────
  Mean latency (ms)             124.50           5.20
  Throughput (samp/s)              513          12307
  Speedup                         1.00x          24.0x
────────────────────────────────────────────────────────
```

---

## Profiler

```python
from qlqoqrqa.profiler.trace import profile_training

with profile_training(active_steps=20, output_path="trace.json") as prof:
    for i, batch in enumerate(loader):
        loss = acc.step(batch, forward_fn)
        prof.step()

prof.print_top(20)   # operator-level bottleneck table
# Open trace.json in https://ui.perfetto.dev
```

---

## LR Schedulers

```python
from qlqoqrqa import get_scheduler

sched = get_scheduler("cosine", optimizer, warmup_steps=200, total_steps=5000)
sched = get_scheduler("linear", optimizer, warmup_steps=100, total_steps=5000)
sched = get_scheduler("onecycle", optimizer, total_steps=5000, max_lr=1e-3)
sched = get_scheduler("constant", optimizer)
```

---

## EMA & SWA

```python
from qlqoqrqa import EMA, SWA

# EMA — call after every optimizer.step()
ema = EMA(model, decay=0.9999)
ema.update()
with ema.average_parameters():
    val_loss = evaluate(model, val_loader)

# SWA — averages weights across epochs
swa = SWA(model, swa_start_epoch=7, swa_lr=5e-4)
swa.attach_optimizer(optimizer)
for epoch in range(10):
    train(...)
    swa.update(epoch)
swa.finalize(train_loader)
```

---

## Multi-GPU (torchrun)

```bash
torchrun --nproc_per_node=4 train.py
```

`qlqoqrqa` auto-detects `RANK` / `LOCAL_RANK` and wraps the model in DDP.
No code changes needed.

---

## AcceleratorConfig reference

```python
AcceleratorConfig(
    dtype                  = "auto",        # "auto"|"bf16"|"fp16"|"fp32"
    compile                = True,
    compile_mode           = "max-autotune",
    grad_accum_steps       = 1,
    max_grad_norm          = 1.0,           # 0 = disabled
    gradient_checkpointing = False,
    channels_last          = True,
    turbo_dataloader       = True,
    num_workers            = -1,            # -1 = auto
    prefetch_factor        = 4,
    tf32                   = True,
    cudnn_benchmark        = True,
    distributed            = False,
    fused_optimizer        = True,
    verbose                = True,
)
```

---

## Publish to PyPI

```bash
pip install build twine
python -m build
twine upload dist/*
```

---

## License

MIT
