Metadata-Version: 2.4
Name: memscale
Version: 1.1.0
Summary: Drop-in memory optimizer for PyTorch training. Reduce VRAM significantly with one line of code.
Author-email: MemScale Team <team@memscale.id>
Maintainer-email: MemScale Team <team@memscale.id>
License: Proprietary
Project-URL: Homepage, https://memscale.id
Project-URL: Documentation, https://memscale.id
Project-URL: Changelog, https://app.memscale.id/changelog
Project-URL: Pricing, https://app.memscale.id/pricing
Keywords: pytorch,deep-learning,memory-optimization,vram,gpu,llm-training,fsdp,distributed-training,training,checkpointing,offloading,mixed-precision
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch<2.10.0,>=2.1.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: httpx>=0.25.0
Provides-Extra: hf
Requires-Dist: transformers>=4.36.0; extra == "hf"
Requires-Dist: datasets>=2.16.0; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Requires-Dist: peft>=0.7.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.1.0; extra == "lightning"
Requires-Dist: torchmetrics>=1.2.0; extra == "lightning"
Provides-Extra: observability
Requires-Dist: prometheus-client>=0.19.0; extra == "observability"
Requires-Dist: tqdm>=4.66.0; extra == "observability"
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.42.0; extra == "quantization"
Provides-Extra: all
Requires-Dist: memscale[hf,lightning,observability,quantization]; extra == "all"
Dynamic: license-file

# MemScale

**Drop-in memory optimizer for PyTorch training. Cut VRAM up to ~76% — and train models that otherwise don't fit — with 1 line of code.**

[![PyPI version](https://img.shields.io/pypi/v/memscale.svg)](https://pypi.org/project/memscale/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Proprietary-blue.svg)](LICENSE)

---

## The problem

Training large models on GPUs hits a wall: **VRAM**.

- GPT-2 Large training → 14.9 GB on a 24 GB RTX 3090, with little headroom
- 1.5B parameter model → out of memory on single 24GB GPU
- DeepSpeed ZeRO setup → 2 weeks of configuration

**MemScale solves this.** Wrap your model in 1 line, cut VRAM up to ~76%, and train models that otherwise run out of memory — no code changes.

## Benchmarks

**Validated on RTX 3090 24GB** (PyTorch 2.12, CUDA 13)
**Reproducible:** `python -m memscale.benchmarks --output results.json`

| Model | Params | Batch × Seq | Baseline | MemScale | Reduction |
|-------|--------|-------------|----------|----------|-----------|
| BERT-Base | 110M | 16 × 128 | 3.14 GB | 0.84 GB | 73.1% |
| BERT-Large | 340M | 16 × 128 | 7.60 GB | 2.02 GB | 73.4% |
| GPT-2 Medium | 355M | 4 × 512 | 10.87 GB | 2.61 GB | 76.0% |
| GPT-2 Large | 774M | 2 × 512 | 14.87 GB | 4.68 GB | 68.5% |
| GPT-2 XL | 1.5B | 1 × 512 | OOM | 9.25 GB | enables training |

Configuration: AGGRESSIVE mode (8-bit Adam + BF16 + checkpointing).
Reduction % scales with workload size — larger batches and longer
sequences typically yield higher percentages. See [BENCHMARK_REPORT.md](BENCHMARK_REPORT.md)
for methodology details.

## Quick start

```bash
pip install memscale
pip install bitsandbytes  # optional, for additional 8-bit Adam savings
```

```python
import memscale
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(per_device_train_batch_size=16),
    train_dataset=dataset,
)

# Add this one line:
trainer = memscale.wrap(trainer)

trainer.train()  # Up to ~76% less VRAM, same speed
```

That's it. MemScale automatically:
1. **Profiles** your model's memory usage per layer
2. **Decides** which optimization technique fits each layer best
3. **Applies** boundary checkpointing, 8-bit Adam, mixed precision
4. **Reports** memory savings and throughput in real time

**No API key required.** Library works fully offline. Anonymous telemetry is enabled by default to help improve the decision engine — opt out anytime with `memscale.disable_telemetry()` or `MEMSCALE_TELEMETRY=0`. See [Telemetry](#telemetry) below for what's collected.

## Experimental: async CPU offload (v1.1+)

v1.1 adds an **experimental** async CPU offload engine. It overlaps
GPU→CPU transfers with compute using tier-aware CUDA streams, while keeping
the same VRAM savings as the default sync path.

```python
import memscale

model = memscale.wrap(model, async_offload=True)
# Same VRAM savings, async transfer.
# Note: speedup is not yet benchmarked — see the v1.2 roadmap.
```

`async_offload` defaults to `False`; existing users are unaffected.
Enabling it emits an `ExperimentalFeatureWarning`. Numerical equivalence
with the sync path is validated (rtol 1e-6 / atol 1e-7 on RTX 3090); the
training-loop speedup is deferred to a dedicated v1.2 benchmarking phase.
For production, use the default sync offload. See the
[async offload user guide](docs/user_guide_async_offload.md) for details.

| GPU Tier | Examples | `async_offload` support |
|----------|----------|-------------------------|
| High | RTX 30xx+, A100, H100 | Full (dual-stream) |
| Low | RTX 20xx, GTX 10xx | Single-stream |
| CPU | No CUDA | Sync only (no async benefit) |

## Maximum reduction (combined techniques)

For maximum savings, enable all techniques:

```python
import torch
from memscale import Config, OptimizationMode
from memscale.phase_f import apply_all_optimizations

model = YourModel()
optimizer = torch.optim.AdamW(model.parameters())

config = Config(
    mode=OptimizationMode.AGGRESSIVE,
    use_8bit_optimizer=True,    # bitsandbytes 8-bit Adam
    use_mixed_precision=True,   # BF16 on Ampere+, FP16 fallback
)

# One call applies all techniques
model, optimizer = apply_all_optimizations(model, optimizer, config)

# Train normally
for batch in dataloader:
    loss = model(batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
```

This stack achieved **73.4% reduction on BERT-Large** (7.60 GB → 2.02 GB) in the v1.1 benchmarks — see [BENCHMARK_REPORT.md](BENCHMARK_REPORT.md).

## How it works

MemScale combines proven memory optimization techniques and chooses what fits each layer:

| Technique | Saves | When applied |
|-----------|-------|--------------|
| Boundary checkpointing | ~70% (activations) | Transformer blocks (BertLayer, GPT2Block, TransformerEncoderLayer, ViTLayer, etc.) |
| 8-bit Adam (bitsandbytes) | ~75% (optimizer state) | When `use_8bit_optimizer=True` and bitsandbytes installed |
| Mixed precision (BF16/FP16) | ~50% (params/activations) | When `use_mixed_precision=True` on Ampere+ GPUs |
| CPU offload (sync + experimental async) | Variable | Large layers when checkpointing not enough |

The decision engine analyzes your model and picks the right technique per layer — you don't need to configure individual layers.

### HuggingFace integration

For HuggingFace autoregressive models (GPT-2, Llama, Mistral, T5), MemScale automatically disables `config.use_cache` when checkpointing is enabled. This prevents the `CheckpointError` that occurs when KV-cache concatenation conflicts with backward recompute. No code changes needed — just `memscale.wrap()`.

## Multi-GPU support

Multi-GPU training works via standard PyTorch DDP. MemScale's per-GPU optimizations apply on each GPU:

```bash
torchrun --nproc_per_node=2 your_training_script.py
```

```python
import memscale
import torch.nn.parallel as parallel

model = YourModel().to(local_rank)
model, optimizer = apply_all_optimizations(model, optimizer, config)
model = parallel.DistributedDataParallel(model, device_ids=[local_rank])

# Train normally - 87% per-GPU reduction with 2x throughput
```

Validated on 2x RTX 3090: **1.69 GB per GPU** (vs 13 GB baseline single-GPU).

### Distributed sharding (research preview)

`memscale.distributed` provides ZeRO-3 inspired parameter and optimizer sharding building blocks. Full integration with model forward/backward hooks is planned for v1.1. For production multi-GPU training requiring 95%+ reduction today, FSDP or DeepSpeed remain the recommended choice.

## Usage modes

### HuggingFace Trainer

```python
import memscale
trainer = memscale.wrap(your_hf_trainer)
trainer.train()
```

### PyTorch Lightning

```python
from lightning import Trainer
from memscale.integrations.lightning import MemScaleLightningCallback

trainer = Trainer(
    callbacks=[MemScaleLightningCallback()],
    max_epochs=10,
)
trainer.fit(model, dataloader)
```

### Custom training loop

```python
import memscale

with memscale.optimize(model, optimizer) as ms:
    for batch in dataloader:
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

### Configuration

Most users don't need this. Defaults work for 90% of cases.

```python
from memscale import wrap, Config, OptimizationMode

config = Config(
    mode=OptimizationMode.AGGRESSIVE,  # or BALANCED (default), CONSERVATIVE
    enable_checkpointing=True,
    enable_offloading=True,
    use_8bit_optimizer=False,    # set True for max reduction
    use_mixed_precision=False,   # set True for max reduction
    target_gpu_utilization=0.85,
)

trainer = wrap(trainer, config=config)
```

## Cost attribution

Track how much money MemScale saves on cloud GPU bills:

```python
from memscale.cost_attribution import CostTracker, estimate_savings

# Quick estimate
report = estimate_savings(
    baseline_vram_gb=70.0,
    memscale_vram_gb=35.0,
    training_hours=10.0,
    gpu_type='A40',
    baseline_gpu_type='A100 80GB',  # GPU you'd need WITHOUT MemScale
)
print(report)
# Baseline (A100 80GB): $24.90
# MemScale (A40):       $15.30
# Savings:              $9.60 (38.6%)
```

Built-in pricing for 16 GPU types (V100, A100, H100, RTX series, AMD MI300X, etc.) plus auto-inference of the cheapest GPU sufficient for your workload.

## OOM prediction

Catch out-of-memory before training starts:

```python
from memscale import OOMPredictor

predictor = OOMPredictor(model_params_bytes=2_000_000_000)  # 2 GB params
risk = predictor.predict(batch_size=16, sequence_length=2048, optimizer='adamw')

if risk.level == 'CRITICAL':
    print(f"⚠️ {risk.message}")
    print(f"Recommendations: {risk.recommendations}")
```

## Telemetry

MemScale ships with anonymous telemetry **enabled by default** to improve the decision engine across diverse hardware and workloads. To opt out:

```python
import memscale
memscale.disable_telemetry()
```

Or via environment variable (set before importing MemScale):

```bash
export MEMSCALE_TELEMETRY=0
```

### What's collected (~1 KB per training run)

- Anonymous client ID (random UUID, stored locally at `~/.memscale/client_id`)
- Library version, Python version, PyTorch version, OS
- Hardware: GPU model, VRAM, CUDA version, number of GPUs
- Model architecture: layer types, parameter count (no weights)
- Optimization outcome: techniques applied, memory saved, throughput overhead

### What's NEVER collected

- ❌ Model weights or training data
- ❌ Code or scripts
- ❌ File paths, hostnames, IP addresses
- ❌ Email or any identifying information
- ❌ Layer-level activations or gradients

Telemetry is fire-and-forget (silent failure on network error), sent over HTTPS to `api.memscale.id/v1/telemetry`. See [our privacy policy](https://memscale.id/privacy) for details.

### Re-enable after opting out

```python
memscale.enable_telemetry()
```

Or:
```bash
export MEMSCALE_TELEMETRY=1
```

## Compatibility

| Component | Min Version | Tested |
|-----------|------------|--------|
| Python | 3.10 | 3.10, 3.11, 3.12 |
| PyTorch | 2.1 | 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9 |
| CUDA | 11.8 | 11.8, 12.1, 12.4, 12.8 |
| GPU | Compute capability 7.0+ | V100, A100, H100, RTX 3090/4090 |
| BF16 mixed precision | Compute capability 8.0+ | A100, H100, RTX 3090/4090 |
| OS | Linux/macOS/Windows | Ubuntu 20.04+, macOS 14+ (arm64), Windows 11 |

AMD GPU support (ROCm) coming in a future release.

## FAQ

**Q: Does MemScale change my training results?**
The activation checkpointing and DDP techniques are mathematically lossless. BF16/FP16 mixed precision introduces small numerical differences — same as standard PyTorch AMP.

**Q: How does this compare to DeepSpeed and FSDP?**
DeepSpeed and FSDP are powerful but require significant configuration and distributed training expertise. MemScale's value is plug-and-play: 1-line wrap with auto-detection. For 95%+ reduction in production multi-GPU setups, DeepSpeed ZeRO-3 is more mature. For single-GPU and DDP workloads, MemScale is competitive and easier to use.

**Q: Will this slow down my training?**
Activation checkpointing adds 20-30% compute overhead (the standard tradeoff). 8-bit Adam adds ~2-5%. Net effect: training is slower per step, but you can use larger batches (better hardware utilization), so end-to-end time often improves.

**Q: What if my model has custom architecture?**
The decision engine handles standard transformers (PyTorch native, HuggingFace BERT/GPT2/Llama/Mistral, vision transformers) automatically. Custom architectures fall back to per-module heuristics. Both are tested.

**Q: Why a range instead of a flat number?**
Reduction depends on model architecture, batch size, sequence length, and which techniques you enable. The v1.1 benchmarks show roughly 68-76% on standard transformers at the default workload, and MemScale enables training models that otherwise run out of memory. Larger batches and longer sequences typically yield higher percentages.

**Q: Is MemScale open source?**
Source code is currently proprietary. PyPI distribution is public (free to install and use). We may open source later based on community feedback. See [memscale.id](https://memscale.id) for licensing.

## Roadmap

- **v1.0.4** (current): 5 medium bug fixes (seq_len awareness, dtype-aware param count, tiling outputs, GPU downgrade cost, estimate_training_memory)
- **v1.1** (Q3 2026): Reproducible benchmark CLI, bug fixes (F-1…F-4), honest cost attribution, and **experimental async CPU offload** (opt-in, correctness-validated; speedup deferred to v1.2)
- **v1.2** (Q4 2026): Async offload speedup benchmarking + older-GPU validation, tensor splitting, ML-based decision policy trained on telemetry data
- **v2.0** (2027): Multi-framework support (JAX, TensorFlow) + MemScale Serve (inference)

## Architecture

MemScale's optimization happens in stages:

1. **Profiling**: Static analysis with empirical fallback for dynamic models
2. **Decision engine**: Per-layer technique selection based on memory profile, hardware budget, and configuration
3. **Execution**: Apply chosen techniques via PyTorch hooks
4. **Observation**: Track memory and throughput, report to user

## Reporting Issues

For bug reports, please include:
1. Minimal reproducible example
2. Hardware (GPU model, VRAM)
3. PyTorch version
4. Output of `memscale.profile_model(model)` if relevant

Email: [team@memscale.id](mailto:team@memscale.id)

## License

Proprietary. Full terms: contact [team@memscale.id](mailto:team@memscale.id) or visit [memscale.id](https://memscale.id).

## Citation

If you use MemScale in your research, please cite:

```bibtex
@software{memscale2026,
  title={MemScale: Drop-in Memory Optimization for PyTorch Training},
  author={MemScale Team},
  year={2026},
  url={https://memscale.id}
}
```

---

**Built for ML practitioners.** Questions? team@memscale.id
