Metadata-Version: 2.4
Name: memscale
Version: 1.0.2
Summary: Drop-in memory optimizer for PyTorch training. Reduce VRAM significantly with one line of code.
Author-email: MemScale Team <team@memscale.id>
Maintainer-email: MemScale Team <team@memscale.id>
License: Proprietary
Project-URL: Homepage, https://memscale.id
Project-URL: Documentation, https://memscale.id
Project-URL: Changelog, https://memscale.id/changelog
Project-URL: Pricing, https://app.memscale.id/pricing
Keywords: pytorch,deep-learning,memory-optimization,vram,gpu,llm-training,fsdp,distributed-training,training,checkpointing,offloading,mixed-precision
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: httpx>=0.25.0
Provides-Extra: hf
Requires-Dist: transformers>=4.36.0; extra == "hf"
Requires-Dist: datasets>=2.16.0; extra == "hf"
Requires-Dist: accelerate>=0.26.0; extra == "hf"
Requires-Dist: peft>=0.7.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.1.0; extra == "lightning"
Requires-Dist: torchmetrics>=1.2.0; extra == "lightning"
Provides-Extra: observability
Requires-Dist: prometheus-client>=0.19.0; extra == "observability"
Requires-Dist: tqdm>=4.66.0; extra == "observability"
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.42.0; extra == "quantization"
Provides-Extra: all
Requires-Dist: memscale[hf,lightning,observability,quantization]; extra == "all"
Dynamic: license-file

# MemScale

**Drop-in memory optimizer for PyTorch training. Reduce VRAM up to 88% with 1 line of code.**

[![PyPI version](https://img.shields.io/pypi/v/memscale.svg)](https://pypi.org/project/memscale/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-233%20passing-brightgreen.svg)](#)

---

## The problem

Training large models on GPUs hits a wall: **VRAM**.

- BERT-Large with batch 16 → 17.6 GB on RTX 3090
- 1.5B parameter model → out of memory on single 24GB GPU
- DeepSpeed ZeRO setup → 2 weeks of configuration

**MemScale solves this.** Wrap your model in 1 line, get up to 88% VRAM reduction, no code changes.

## Real benchmarks (validated on RTX 3090 24GB)

| Model | Params | Baseline | MemScale | Reduction |
|-------|--------|----------|----------|-----------|
| BERT-Base | 85M | 6.39 GB | 1.87 GB | **70.8%** |
| BERT-Large | 302M | 17.57 GB | 2.11 GB | **88.0%** |
| GPT-2 Medium | 302M | 19.02 GB | 7.16 GB | **62.4%** |
| GPT-2 Large | 708M | 21.78 GB | 4.88 GB | **77.6%** |
| 1.3B model | 1.3B | OOM | 8.86 GB | **Enables training** |
| GPT-2 XL | 1.5B | OOM | 12.72 GB | **Enables training** |

**Comparison:** PyTorch native checkpointing achieves 70% on the same workloads. MemScale matches or exceeds it with the right configuration, and **enables training models that PyTorch alone cannot fit**.

## Quick start

```bash
pip install memscale
pip install bitsandbytes  # optional, for additional 8-bit Adam savings
```

```python
import memscale
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(per_device_train_batch_size=16),
    train_dataset=dataset,
)

# Add this one line:
trainer = memscale.wrap(trainer)

trainer.train()  # Up to 88% less VRAM, same speed
```

That's it. MemScale automatically:
1. **Profiles** your model's memory usage per layer
2. **Decides** which optimization technique fits each layer best
3. **Applies** boundary checkpointing, 8-bit Adam, mixed precision
4. **Reports** memory savings and throughput in real time

## Maximum reduction (combined techniques)

For maximum savings, enable all techniques:

```python
import torch
from memscale import Config, OptimizationMode
from memscale.phase_f import apply_all_optimizations

model = YourModel()
optimizer = torch.optim.AdamW(model.parameters())

config = Config(
    mode=OptimizationMode.AGGRESSIVE,
    use_8bit_optimizer=True,    # bitsandbytes 8-bit Adam
    use_mixed_precision=True,   # BF16 on Ampere+, FP16 fallback
)

# One call applies all techniques
model, optimizer = apply_all_optimizations(model, optimizer, config)

# Train normally
for batch in dataloader:
    loss = model(batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
```

This stack achieved **88.0% reduction on BERT-Large** in our benchmarks.

## How it works

MemScale combines proven memory optimization techniques and chooses what fits each layer:

| Technique | Saves | When applied |
|-----------|-------|--------------|
| Boundary checkpointing | ~70% (activations) | Transformer blocks (BertLayer, GPT2Block, TransformerEncoderLayer, ViTLayer, etc.) |
| 8-bit Adam (bitsandbytes) | ~75% (optimizer state) | When `use_8bit_optimizer=True` and bitsandbytes installed |
| Mixed precision (BF16/FP16) | ~50% (params/activations) | When `use_mixed_precision=True` on Ampere+ GPUs |
| CPU offload | Variable | Large layers when checkpointing not enough |

The decision engine analyzes your model and picks the right technique per layer — you don't need to configure individual layers.

## Multi-GPU support

Multi-GPU training works via standard PyTorch DDP. MemScale's per-GPU optimizations apply on each GPU:

```bash
torchrun --nproc_per_node=2 your_training_script.py
```

```python
import memscale
import torch.nn.parallel as parallel

model = YourModel().to(local_rank)
model, optimizer = apply_all_optimizations(model, optimizer, config)
model = parallel.DistributedDataParallel(model, device_ids=[local_rank])

# Train normally - 87% per-GPU reduction with 2x throughput
```

Validated on 2x RTX 3090: **1.69 GB per GPU** (vs 13 GB baseline single-GPU).

### Distributed sharding (research preview)

Phase G provides ZeRO-3 inspired parameter and optimizer sharding building blocks:

```python
from memscale.distributed import (
    init_distributed,
    shard_model_parameters,
    ShardedOptimizer,
)
```

**Note:** Phase G provides the `ShardedParameter` and `ShardedOptimizer` classes with NCCL-based all-gather. Full integration with model forward/backward hooks is planned for v1.1. For production multi-GPU training requiring 95%+ reduction today, use FSDP or DeepSpeed.

## Usage modes

### HuggingFace Trainer

```python
import memscale
trainer = memscale.wrap(your_hf_trainer)
trainer.train()
```

### PyTorch Lightning

```python
from lightning import Trainer
from memscale.integrations.lightning import MemScaleLightningCallback

trainer = Trainer(
    callbacks=[MemScaleLightningCallback()],
    max_epochs=10,
)
trainer.fit(model, dataloader)
```

### Custom training loop

```python
import memscale

with memscale.optimize(model, optimizer) as ms:
    for batch in dataloader:
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

### Configuration

Most users don't need this. Defaults work for 90% of cases.

```python
from memscale import wrap, Config, OptimizationMode

config = Config(
    mode=OptimizationMode.AGGRESSIVE,  # or BALANCED (default), CONSERVATIVE
    enable_checkpointing=True,
    enable_offloading=True,
    use_8bit_optimizer=False,    # set True for max reduction
    use_mixed_precision=False,   # set True for max reduction
    target_gpu_utilization=0.85,
)

trainer = wrap(trainer, config=config)
```

## Compatibility

| Component | Min Version | Tested |
|-----------|------------|--------|
| Python | 3.9 | 3.9, 3.10, 3.11, 3.12 |
| PyTorch | 2.1 | 2.1, 2.2, 2.3, 2.4 |
| CUDA | 11.8 | 11.8, 12.1, 12.4, 12.8 |
| GPU | Compute capability 7.0+ | V100, A100, H100, RTX 3090/4090 |
| BF16 mixed precision | Compute capability 8.0+ | A100, H100, RTX 3090/4090 |
| OS | Linux | Ubuntu 20.04, 22.04, 24.04 |

AMD GPU support (ROCm) coming in a future release.

## FAQ

**Q: Does MemScale change my training results?**
The activation checkpointing and DDP techniques are mathematically lossless. BF16/FP16 mixed precision introduces small numerical differences — same as standard PyTorch AMP.

**Q: How does this compare to DeepSpeed and FSDP?**
DeepSpeed and FSDP are powerful but require significant configuration and distributed training expertise. MemScale's value is plug-and-play: 1-line wrap with auto-detection. For 95%+ reduction in production multi-GPU setups, DeepSpeed ZeRO-3 is more mature. For single-GPU and DDP workloads, MemScale is competitive and easier to use.

**Q: Will this slow down my training?**
Activation checkpointing adds 20-30% compute overhead (the standard tradeoff). 8-bit Adam adds ~2-5%. Net effect: training is slower per step, but you can use larger batches (better hardware utilization), so end-to-end time often improves.

**Q: What if my model has custom architecture?**
The decision engine handles standard transformers (PyTorch native, HuggingFace BERT/GPT2/Llama/Mistral, vision transformers) automatically. Custom architectures fall back to per-module heuristics. Both are tested.

**Q: Why "up to 88%" instead of a flat number?**
Reduction depends on model architecture, batch size, sequence length, and which techniques you enable. Our benchmarks show 62-88% on standard transformers. Smaller and older models show less; large modern models with long sequences see the most savings.

## Roadmap

- **v1.0** (current): Boundary checkpointing, 8-bit Adam, BF16 mixed precision, DDP support, sharding building blocks
- **v1.1**: Phase G full integration (ZeRO-3 forward/backward hooks), AMD GPU (ROCm)
- **v1.2**: FSDP/DeepSpeed integration helpers, web dashboard
- **v2.0**: Learned decision policy (model-specific tuning)

## Architecture

MemScale's optimization happens in stages:

1. **Profiling**: Static analysis via `torch.fx`, with empirical fallback for dynamic models
2. **Decision engine**: Per-layer technique selection based on memory profile, hardware budget, and configuration
3. **Execution**: Apply chosen techniques via PyTorch hooks
4. **Observation**: Track memory and throughput, report to user

Source code is organized as:

```
memscale/
├── core/           # Profiler, decision engine, executor, config
├── techniques/     # Checkpointing, 8-bit optimizer, mixed precision
├── distributed/   # FSDP integration + ZeRO-3 inspired sharding (Phase G)
├── integrations/   # HuggingFace, Lightning adapters
└── phase_f.py      # apply_all_optimizations one-line API
```

## Contributing

Issues and PRs welcome. Please include:
1. Minimal reproducible example
2. Hardware (GPU model, VRAM)
3. PyTorch version
4. Output of `memscale.profile_model(model)` if relevant

## License

Apache 2.0 — see [LICENSE](LICENSE).

## Citation

If you use MemScale in your research, please cite:

```bibtex
@software{memscale2026,
  title={MemScale: Drop-in Memory Optimization for PyTorch Training},
  author={MemScale Team},
  year={2026},
  url={https://github.com/MrGinkaku/MemScale}
}
```

---

**Built for ML practitioners.** Questions? team@memscale.id
