Metadata-Version: 2.4
Name: mamba-scan-lite
Version: 0.1.0
Summary: Memory-efficient Mamba2 scan for HuggingFace Transformers. Fixes Zamba2 OOM on small GPUs.
Author: EchoLabs
License: MIT
Project-URL: Homepage, https://github.com/echo313unfolding/mamba-scan-lite
Project-URL: Repository, https://github.com/echo313unfolding/mamba-scan-lite
Project-URL: Issues, https://github.com/echo313unfolding/mamba-scan-lite/issues
Keywords: mamba,mamba2,zamba2,ssm,state-space-model,memory-efficient,oom-fix,transformers,huggingface,small-gpu,quantization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.45
Dynamic: license-file

# mamba-scan-lite

**Memory-efficient Mamba2 scan for HuggingFace Transformers. Fixes Zamba2 OOM on small GPUs. No CUDA compilation required.**

## The Problem

Running Zamba2 models on GPUs with less than 8 GB VRAM fails with `OutOfMemoryError`, even though the model weights fit. The cause is HuggingFace's naive Mamba2 scan implementation, which materializes GB-sized intermediate tensors for the SSM computation.

The official fix is to install `mamba-ssm`, but that requires CUDA compilation that fails on many setups (driver mismatches, missing headers, ABI conflicts).

## The Fix

```bash
pip install mamba-scan-lite
```

```python
import mamba_scan_lite  # patches HF Zamba2 automatically

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Zyphra/Zamba2-2.7B-instruct",
    device_map="cuda",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B-instruct")

inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

That's it. One import, no configuration.

## What It Does

Replaces two components in HF's `Zamba2MambaMixer.torch_forward`:

1. **Sequential SSM scan** instead of the chunked SSD that allocates GB-sized 6D tensors. Processes tokens one at a time, maintaining an `[batch, heads, head_dim, state_size]` state (~1 MB) instead of materializing the full scan matrix.

2. **Manual conv1d** via unfold+einsum instead of `F.conv1d`, which avoids cuDNN initialization failures on older drivers (Turing/SM75 GPUs).

The decode path (single-token generation with KV cache) is unchanged.

## Benchmarks

Tested on NVIDIA Quadro T2000 (4 GB VRAM, Turing SM75):

| Model | Without Patch | With Patch | VRAM Peak |
|-------|--------------|------------|-----------|
| Zamba2-1.2B | OOM | **2.3 tok/s** | 1,560 MB |
| Zamba2-2.7B-Instruct | OOM | **1.0 tok/s** | 3,134 MB |

Also fixes `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED` on Turing GPUs.

## When to Use This

- You get `OutOfMemoryError` running Zamba2 on a small GPU
- `mamba-ssm` won't compile on your system
- You get `cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`
- You want to run Zamba2 without any CUDA kernel compilation

## When NOT to Use This

- You have `mamba-ssm` installed and working (the official kernels are faster)
- You're on a large GPU (24+ GB) where the naive path doesn't OOM
- You need maximum throughput (this trades speed for memory)

## How It Works

The HF naive Mamba2 scan computes the structured state space dual (SSD) via chunked matrix operations:

```python
# HF naive path — allocates ~1+ GB intermediate
G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]  # 6D tensor
M = (G * L).sum(-1)
Y_diag = (M[..., None] * hidden_states[:, :, None]).sum(3)        # OOM here
```

The patched path computes the same recurrence sequentially:

```python
# mamba-scan-lite — ~1 MB state
for t in range(seq_len):
    h = exp(A * dt[t]) * h + (dt[t] * x[t]) outer B[t]
    y[t] = (C[t] * h).sum(-1) + D * x[t]
```

Same math. O(state_size) memory instead of O(seq_len * state_size^2).

## Requirements

- Python >= 3.9
- PyTorch >= 2.0
- Transformers >= 4.45 (Zamba2 support)

## Also From EchoLabs

- [helix-substrate](https://pypi.org/project/helix-substrate/) — Universal model weight compression via HXQ (2D Vector Quantization). Faster-than-dense inference on compressed models.
- [EchoLabs33 on HuggingFace](https://huggingface.co/EchoLabs33) — 12 compressed models across Transformer, SSM, and Hybrid architectures.

## License

MIT
