Metadata-Version: 2.4
Name: memscale
Version: 1.0.4
Summary: Drop-in memory optimizer for PyTorch training. Reduce VRAM significantly with one line of code.
Author-email: MemScale Team <team@memscale.id>
Maintainer-email: MemScale Team <team@memscale.id>
License: Proprietary
Project-URL: Homepage, https://memscale.id
Project-URL: Documentation, https://memscale.id
Project-URL: Changelog, https://memscale.id/changelog
Project-URL: Pricing, https://app.memscale.id/pricing
Keywords: pytorch,deep-learning,memory-optimization,vram,gpu,llm-training,fsdp,distributed-training,training,checkpointing,offloading,mixed-precision
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch<2.10.0,>=2.1.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: httpx>=0.25.0
Provides-Extra: hf
Requires-Dist: transformers>=4.36.0; extra == "hf"
Requires-Dist: datasets>=2.16.0; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Requires-Dist: peft>=0.7.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.1.0; extra == "lightning"
Requires-Dist: torchmetrics>=1.2.0; extra == "lightning"
Provides-Extra: observability
Requires-Dist: prometheus-client>=0.19.0; extra == "observability"
Requires-Dist: tqdm>=4.66.0; extra == "observability"
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.42.0; extra == "quantization"
Provides-Extra: all
Requires-Dist: memscale[hf,lightning,observability,quantization]; extra == "all"
Dynamic: license-file

# MemScale

**Drop-in memory optimizer for PyTorch training. Reduce VRAM up to 88% with 1 line of code.**

[![PyPI version](https://img.shields.io/pypi/v/memscale.svg)](https://pypi.org/project/memscale/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Proprietary-blue.svg)](LICENSE)

---

## The problem

Training large models on GPUs hits a wall: **VRAM**.

- BERT-Large with batch 16 → 17.6 GB on RTX 3090
- 1.5B parameter model → out of memory on single 24GB GPU
- DeepSpeed ZeRO setup → 2 weeks of configuration

**MemScale solves this.** Wrap your model in 1 line, get up to 88% VRAM reduction, no code changes.

## Real benchmarks (validated on RTX 3090 24GB)

| Model | Params | Baseline | MemScale | Reduction |
|-------|--------|----------|----------|-----------|
| BERT-Base | 85M | 6.39 GB | 1.87 GB | **70.8%** |
| BERT-Large | 302M | 17.57 GB | 2.11 GB | **88.0%** |
| GPT-2 Medium | 302M | 19.02 GB | 7.16 GB | **62.4%** |
| GPT-2 Large | 708M | 21.78 GB | 4.88 GB | **77.6%** |
| 1.3B model | 1.3B | OOM | 8.86 GB | **Enables training** |
| GPT-2 XL | 1.5B | OOM | 12.72 GB | **Enables training** |

**Comparison:** PyTorch native checkpointing achieves 70% on the same workloads. MemScale matches or exceeds it with the right configuration, and **enables training models that PyTorch alone cannot fit**.

## Quick start

```bash
pip install memscale
pip install bitsandbytes  # optional, for additional 8-bit Adam savings
```

```python
import memscale
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(per_device_train_batch_size=16),
    train_dataset=dataset,
)

# Add this one line:
trainer = memscale.wrap(trainer)

trainer.train()  # Up to 88% less VRAM, same speed
```

That's it. MemScale automatically:
1. **Profiles** your model's memory usage per layer
2. **Decides** which optimization technique fits each layer best
3. **Applies** boundary checkpointing, 8-bit Adam, mixed precision
4. **Reports** memory savings and throughput in real time

**No API key required.** Library works fully offline. Anonymous telemetry is enabled by default to help improve the decision engine — opt out anytime with `memscale.disable_telemetry()` or `MEMSCALE_TELEMETRY=0`. See [Telemetry](#telemetry) below for what's collected.

## Maximum reduction (combined techniques)

For maximum savings, enable all techniques:

```python
import torch
from memscale import Config, OptimizationMode
from memscale.phase_f import apply_all_optimizations

model = YourModel()
optimizer = torch.optim.AdamW(model.parameters())

config = Config(
    mode=OptimizationMode.AGGRESSIVE,
    use_8bit_optimizer=True,    # bitsandbytes 8-bit Adam
    use_mixed_precision=True,   # BF16 on Ampere+, FP16 fallback
)

# One call applies all techniques
model, optimizer = apply_all_optimizations(model, optimizer, config)

# Train normally
for batch in dataloader:
    loss = model(batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
```

This stack achieved **88.0% reduction on BERT-Large** in our benchmarks.

## How it works

MemScale combines proven memory optimization techniques and chooses what fits each layer:

| Technique | Saves | When applied |
|-----------|-------|--------------|
| Boundary checkpointing | ~70% (activations) | Transformer blocks (BertLayer, GPT2Block, TransformerEncoderLayer, ViTLayer, etc.) |
| 8-bit Adam (bitsandbytes) | ~75% (optimizer state) | When `use_8bit_optimizer=True` and bitsandbytes installed |
| Mixed precision (BF16/FP16) | ~50% (params/activations) | When `use_mixed_precision=True` on Ampere+ GPUs |
| CPU offload | Variable | Large layers when checkpointing not enough |

The decision engine analyzes your model and picks the right technique per layer — you don't need to configure individual layers.

### HuggingFace integration

For HuggingFace autoregressive models (GPT-2, Llama, Mistral, T5), MemScale automatically disables `config.use_cache` when checkpointing is enabled. This prevents the `CheckpointError` that occurs when KV-cache concatenation conflicts with backward recompute. No code changes needed — just `memscale.wrap()`.

## Multi-GPU support

Multi-GPU training works via standard PyTorch DDP. MemScale's per-GPU optimizations apply on each GPU:

```bash
torchrun --nproc_per_node=2 your_training_script.py
```

```python
import memscale
import torch.nn.parallel as parallel

model = YourModel().to(local_rank)
model, optimizer = apply_all_optimizations(model, optimizer, config)
model = parallel.DistributedDataParallel(model, device_ids=[local_rank])

# Train normally - 87% per-GPU reduction with 2x throughput
```

Validated on 2x RTX 3090: **1.69 GB per GPU** (vs 13 GB baseline single-GPU).

### Distributed sharding (research preview)

`memscale.distributed` provides ZeRO-3 inspired parameter and optimizer sharding building blocks. Full integration with model forward/backward hooks is planned for v1.1. For production multi-GPU training requiring 95%+ reduction today, FSDP or DeepSpeed remain the recommended choice.

## Usage modes

### HuggingFace Trainer

```python
import memscale
trainer = memscale.wrap(your_hf_trainer)
trainer.train()
```

### PyTorch Lightning

```python
from lightning import Trainer
from memscale.integrations.lightning import MemScaleLightningCallback

trainer = Trainer(
    callbacks=[MemScaleLightningCallback()],
    max_epochs=10,
)
trainer.fit(model, dataloader)
```

### Custom training loop

```python
import memscale

with memscale.optimize(model, optimizer) as ms:
    for batch in dataloader:
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

### Configuration

Most users don't need this. Defaults work for 90% of cases.

```python
from memscale import wrap, Config, OptimizationMode

config = Config(
    mode=OptimizationMode.AGGRESSIVE,  # or BALANCED (default), CONSERVATIVE
    enable_checkpointing=True,
    enable_offloading=True,
    use_8bit_optimizer=False,    # set True for max reduction
    use_mixed_precision=False,   # set True for max reduction
    target_gpu_utilization=0.85,
)

trainer = wrap(trainer, config=config)
```

## Cost attribution

Track how much money MemScale saves on cloud GPU bills:

```python
from memscale.cost_attribution import CostTracker, estimate_savings

# Quick estimate
report = estimate_savings(
    baseline_vram_gb=70.0,
    memscale_vram_gb=35.0,
    training_hours=10.0,
    gpu_type='A40',
    baseline_gpu_type='A100 80GB',  # GPU you'd need WITHOUT MemScale
)
print(report)
# Baseline (A100 80GB): $24.90
# MemScale (A40):       $15.30
# Savings:              $9.60 (38.6%)
```

Built-in pricing for 16 GPU types (V100, A100, H100, RTX series, AMD MI300X, etc.) plus auto-inference of the cheapest GPU sufficient for your workload.

## OOM prediction

Catch out-of-memory before training starts:

```python
from memscale import OOMPredictor

predictor = OOMPredictor(model_params_bytes=2_000_000_000)  # 2 GB params
risk = predictor.predict(batch_size=16, sequence_length=2048, optimizer='adamw')

if risk.level == 'CRITICAL':
    print(f"⚠️ {risk.message}")
    print(f"Recommendations: {risk.recommendations}")
```

## Telemetry

MemScale ships with anonymous telemetry **enabled by default** to improve the decision engine across diverse hardware and workloads. To opt out:

```python
import memscale
memscale.disable_telemetry()
```

Or via environment variable (set before importing MemScale):

```bash
export MEMSCALE_TELEMETRY=0
```

### What's collected (~1 KB per training run)

- Anonymous client ID (random UUID, stored locally at `~/.memscale/client_id`)
- Library version, Python version, PyTorch version, OS
- Hardware: GPU model, VRAM, CUDA version, number of GPUs
- Model architecture: layer types, parameter count (no weights)
- Optimization outcome: techniques applied, memory saved, throughput overhead

### What's NEVER collected

- ❌ Model weights or training data
- ❌ Code or scripts
- ❌ File paths, hostnames, IP addresses
- ❌ Email or any identifying information
- ❌ Layer-level activations or gradients

Telemetry is fire-and-forget (silent failure on network error), sent over HTTPS to `api.memscale.id/v1/telemetry`. See [our privacy policy](https://memscale.id/privacy) for details.

### Re-enable after opting out

```python
memscale.enable_telemetry()
```

Or:
```bash
export MEMSCALE_TELEMETRY=1
```

## Compatibility

| Component | Min Version | Tested |
|-----------|------------|--------|
| Python | 3.10 | 3.10, 3.11, 3.12 |
| PyTorch | 2.1 | 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9 |
| CUDA | 11.8 | 11.8, 12.1, 12.4, 12.8 |
| GPU | Compute capability 7.0+ | V100, A100, H100, RTX 3090/4090 |
| BF16 mixed precision | Compute capability 8.0+ | A100, H100, RTX 3090/4090 |
| OS | Linux/macOS/Windows | Ubuntu 20.04+, macOS 14+ (arm64), Windows 11 |

AMD GPU support (ROCm) coming in a future release.

## FAQ

**Q: Does MemScale change my training results?**
The activation checkpointing and DDP techniques are mathematically lossless. BF16/FP16 mixed precision introduces small numerical differences — same as standard PyTorch AMP.

**Q: How does this compare to DeepSpeed and FSDP?**
DeepSpeed and FSDP are powerful but require significant configuration and distributed training expertise. MemScale's value is plug-and-play: 1-line wrap with auto-detection. For 95%+ reduction in production multi-GPU setups, DeepSpeed ZeRO-3 is more mature. For single-GPU and DDP workloads, MemScale is competitive and easier to use.

**Q: Will this slow down my training?**
Activation checkpointing adds 20-30% compute overhead (the standard tradeoff). 8-bit Adam adds ~2-5%. Net effect: training is slower per step, but you can use larger batches (better hardware utilization), so end-to-end time often improves.

**Q: What if my model has custom architecture?**
The decision engine handles standard transformers (PyTorch native, HuggingFace BERT/GPT2/Llama/Mistral, vision transformers) automatically. Custom architectures fall back to per-module heuristics. Both are tested.

**Q: Why "up to 88%" instead of a flat number?**
Reduction depends on model architecture, batch size, sequence length, and which techniques you enable. Our benchmarks show 62-88% on standard transformers. Smaller and older models show less; large modern models with long sequences see the most savings.

**Q: Is MemScale open source?**
Source code is currently proprietary. PyPI distribution is public (free to install and use). We may open source later based on community feedback. See [memscale.id](https://memscale.id) for licensing.

## Roadmap

- **v1.0.4** (current): 5 medium bug fixes (seq_len awareness, dtype-aware param count, tiling outputs, GPU downgrade cost, estimate_training_memory)
- **v1.1** (Q3 2026): Stability release, multi-GPU verified, AMD GPU (ROCm), full ZeRO-3 integration
- **v1.2** (Q4 2026): ML-based decision policy trained on telemetry data
- **v2.0** (2027): Multi-framework support (JAX, TensorFlow) + MemScale Serve (inference)

## Architecture

MemScale's optimization happens in stages:

1. **Profiling**: Static analysis with empirical fallback for dynamic models
2. **Decision engine**: Per-layer technique selection based on memory profile, hardware budget, and configuration
3. **Execution**: Apply chosen techniques via PyTorch hooks
4. **Observation**: Track memory and throughput, report to user

## Reporting Issues

For bug reports, please include:
1. Minimal reproducible example
2. Hardware (GPU model, VRAM)
3. PyTorch version
4. Output of `memscale.profile_model(model)` if relevant

Email: [team@memscale.id](mailto:team@memscale.id)

## License

Proprietary. Full terms: contact [team@memscale.id](mailto:team@memscale.id) or visit [memscale.id](https://memscale.id).

## Citation

If you use MemScale in your research, please cite:

```bibtex
@software{memscale2026,
  title={MemScale: Drop-in Memory Optimization for PyTorch Training},
  author={MemScale Team},
  year={2026},
  url={https://memscale.id}
}
```

---

**Built for ML practitioners.** Questions? team@memscale.id
