Metadata-Version: 2.4
Name: outofcuda
Version: 0.1.0
Summary: Zero-code CUDA memory & compute optimizer — works the moment you pip install it
Author: outofcuda contributors
License: MIT
Project-URL: Homepage, https://github.com/outofcuda/outofcuda
Project-URL: Documentation, https://outofcuda.readthedocs.io
Project-URL: Repository, https://github.com/outofcuda/outofcuda
Project-URL: Bug Tracker, https://github.com/outofcuda/outofcuda/issues
Project-URL: Changelog, https://github.com/outofcuda/outofcuda/blob/main/CHANGELOG.md
Keywords: cuda,gpu,pytorch,memory,optimization,deep-learning,machine-learning,vram,out-of-memory,oom,torch,autocast,amp,performance
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Hardware
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: dev
Requires-Dist: torch>=2.0; extra == "dev"
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# outofcuda 🚀

> **Zero-code CUDA memory & compute optimizer for PyTorch.**  
> Install once. It works. No imports, no function calls, no config needed.

[![PyPI version](https://img.shields.io/pypi/v/outofcuda.svg)](https://pypi.org/project/outofcuda/)
[![Python](https://img.shields.io/pypi/pyversions/outofcuda.svg)](https://pypi.org/project/outofcuda/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Tests](https://github.com/outofcuda/outofcuda/actions/workflows/ci.yml/badge.svg)](https://github.com/outofcuda/outofcuda/actions)

---

## The problem

You're training a model, everything's going great, then:

```
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB.
```

Or your GPU is sitting at 40 % utilization because PyTorch's defaults were
written for correctness, not speed.

**outofcuda** fixes both — automatically, the moment you install it.

---

## Install

```bash
pip install outofcuda
```

That's it. The library installs a Python site-hook (`.pth` file) that fires
before your first `import torch`, applying every optimization listed below.
**You don't write a single line of code.**

---

## How it works

Python processes every `.pth` file in `site-packages` at interpreter startup.
`outofcuda` installs `outofcuda_hook.pth`, which imports a tiny bootstrap
module.  The bootstrap attaches a meta-path watcher that fires once, right
after `torch` is imported, and applies the full optimization suite:

```
Python starts
  └─ site processes outofcuda_hook.pth
       └─ _hook.py registers a meta-path watcher
            └─ torch is imported (by your code or a library)
                 └─ outofcuda.apply() runs automatically ✓
```

Everything is **lazy** — if torch is never imported, outofcuda does nothing.
If CUDA is unavailable, it silently skips all GPU-specific steps.

---

## What gets optimized

### 1 · TF32 Matmul & cuDNN (Ampere+)

```python
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32       = True
```

On RTX 30xx / A100 / H100, TF32 delivers up to **10× faster** matrix
multiplications with negligible precision loss for most models.

### 2 · cuDNN autotuner

```python
torch.backends.cudnn.benchmark = True
```

cuDNN runs a short benchmark on the first batch to select the fastest
convolution algorithm for your input shapes — a one-time cost that pays off
across every subsequent forward pass.

### 3 · CUDA memory allocator tuning

```
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,
                        garbage_collection_threshold:0.8,
                        expandable_segments:True
```

- **`max_split_size_mb:128`** — prevents the caching allocator from
  fragmenting large blocks, reducing OOM errors caused by fragmentation.
- **`garbage_collection_threshold:0.8`** — triggers internal GC when
  80 % of reserved memory is in-use, proactively recovering dead tensors.
- **`expandable_segments:True`** — allows the allocator to grow segments
  on demand instead of reserving large chunks upfront.

### 4 · Per-device memory fraction

```python
torch.cuda.set_per_process_memory_fraction(0.95, device=i)
```

Prevents a single process from reserving 100 % of VRAM, leaving headroom for
the driver, NCCL buffers, and peer processes.

### 5 · Scaled Dot-Product Attention backends

```python
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
```

Enables all three SDPA backends — PyTorch picks **Flash Attention** when
possible (O(N) memory vs O(N²)), falling back to memory-efficient or math
attention automatically.

### 6 · Automatic bfloat16 on Ampere+

When the GPU supports it (compute capability ≥ 8.0), outofcuda switches
the default AMP dtype from `float16` → `bfloat16`, giving better numerical
range without gradient underflow.

### 7 · OOM recovery hook

A `sys.excepthook` wrapper catches `torch.cuda.OutOfMemoryError` and
immediately clears the CUDA cache + runs Python GC, giving your process a
fighting chance before the kernel kills it.

---

## Optional: use the Python API for more control

Even though you don't need to write code, outofcuda exposes a full API for
power users.

### AMP context manager

```python
import outofcuda

with outofcuda.autocast_context():
    logits = model(input_ids)
    loss = criterion(logits, labels)
```

Automatically picks `bfloat16` on Ampere+ and `float16` on older cards.

### Optimise a model

```python
model = outofcuda.compile_model(model)
```

Converts to channels-last memory layout (better tensor-core utilization) and,
if `OUTOFCUDA_COMPILE=1` is set, wraps with `torch.compile(mode="reduce-overhead")`.

### Smart DataLoader

```python
loader = outofcuda.smart_dataloader(
    dataset,
    batch_size=64,
    shuffle=True,
)
```

Pre-configured with `pin_memory=True`, `num_workers=4`, `prefetch_factor=2`,
and `persistent_workers=True` — the settings most people forget to set.

### Memory report

```python
print(outofcuda.memory_report())
# {
#   "cuda:0": {
#     "name": "NVIDIA A100-SXM4-80GB",
#     "total_gb": 80.0,
#     "allocated_gb": 12.4,
#     "reserved_gb": 14.1,
#     "free_gb": 65.9,
#     "peak_allocated_gb": 18.7
#   }
# }
```

### Clear CUDA cache

```python
outofcuda.clear_cache()   # empty_cache() + gc.collect()
```

### VRAM watchdog

```python
from outofcuda import MemoryMonitor

with MemoryMonitor(threshold=0.85, interval=2.0):
    train(model, loader)
# auto-clears cache whenever VRAM > 85 %
```

### All-in-one class

```python
opt = outofcuda.CudaOptimizer()

model  = opt.prepare(model)          # channels-last + compile
loader = opt.dataloader(dataset)     # smart DataLoader
with opt.autocast():                 # AMP
    loss = model(x)
opt.clear()                          # cache flush
print(opt.report())                  # VRAM stats
```

---

## Environment variables

Every knob is tunable via env vars — no code changes needed.

| Variable | Default | Description |
|---|---|---|
| `OUTOFCUDA_DISABLE` | `0` | Set to `1` to fully disable |
| `OUTOFCUDA_VERBOSE` | `0` | Print optimization report on startup |
| `OUTOFCUDA_TF32` | `1` | Enable TF32 for matmul / cuDNN |
| `OUTOFCUDA_CUDNN_BENCHMARK` | `1` | Enable cuDNN autotuner |
| `OUTOFCUDA_CUDNN_DETERMINISTIC` | `0` | Force deterministic cuDNN |
| `OUTOFCUDA_CHANNELS_LAST` | `1` | Prefer channels-last layout |
| `OUTOFCUDA_COMPILE` | `0` | Use `torch.compile()` in `compile_model()` |
| `OUTOFCUDA_COMPILE_MODE` | `reduce-overhead` | `torch.compile` mode |
| `OUTOFCUDA_COMPILE_FULLGRAPH` | `0` | `fullgraph=True` for compile |
| `OUTOFCUDA_AMP_DTYPE` | `float16` | AMP dtype (`float16` / `bfloat16`) |
| `OUTOFCUDA_MEMORY_FRACTION` | `0.95` | Max VRAM fraction per device |
| `OUTOFCUDA_ALLOC_CONF` | *(see above)* | `PYTORCH_CUDA_ALLOC_CONF` string |
| `OUTOFCUDA_GC_COLLECT` | `1` | Run `gc.collect()` on cache clear |
| `OUTOFCUDA_FLASH_ATTN` | `1` | Enable Flash Attention SDPA backend |
| `OUTOFCUDA_MEM_EFF_ATTN` | `1` | Enable memory-efficient SDPA backend |
| `OUTOFCUDA_MATH_ATTN` | `1` | Enable math SDPA backend |
| `OUTOFCUDA_PIN_MEMORY` | `1` | `pin_memory` for smart DataLoader |
| `OUTOFCUDA_NUM_WORKERS` | `4` | `num_workers` for smart DataLoader |
| `OUTOFCUDA_PREFETCH_FACTOR` | `2` | `prefetch_factor` for smart DataLoader |

**Example** — disable everything except the allocator fix:

```bash
OUTOFCUDA_TF32=0 \
OUTOFCUDA_CUDNN_BENCHMARK=0 \
OUTOFCUDA_CHANNELS_LAST=0 \
python train.py
```

---

## Expected gains

Results vary by GPU generation, model architecture, and batch size.
Here are typical observations on an A100-80 GB:

| Optimization | Speedup / Saving |
|---|---|
| TF32 matmul | 2–10× on dense layers |
| cuDNN benchmark | 5–20 % on conv-heavy models |
| Flash Attention | 2–4× less VRAM, 1.5–3× faster attention |
| channels-last | 10–30 % faster on CNNs |
| AMP bfloat16 | 1.5–2× throughput, ~50 % less VRAM |
| Allocator tuning | Fewer OOM crashes, less fragmentation |

---

## Requirements

- Python ≥ 3.9
- PyTorch ≥ 2.0 *(optional — outofcuda installs without torch)*
- CUDA ≥ 11.8 *(optional — CPU-only is safe)*

No other dependencies.

---

## Design philosophy

> **The best optimization is the one that happens without you thinking about it.**

outofcuda is deliberately a **zero-footprint** library:

- No monkey-patching of `torch` internals
- No global state mutation beyond documented CUDA / cuDNN flags
- No background threads unless you explicitly start `MemoryMonitor`
- Safe to import in tests, notebooks, scripts, and production services
- Fully disableable with `OUTOFCUDA_DISABLE=1`

---

## Contributing

```bash
git clone https://github.com/outofcuda/outofcuda
cd outofcuda
pip install -e ".[dev]"
pytest
```

Issues and PRs welcome.

---

## License

[MIT](LICENSE) © outofcuda contributors
