Metadata-Version: 2.4
Name: mamba-scan-lite
Version: 0.2.0
Summary: Memory-efficient Mamba2 scan for HuggingFace Transformers. Fixes Zamba2 OOM on small GPUs.
Author: EchoLabs
License: MIT
Project-URL: Homepage, https://github.com/echo313unfolding/mamba-scan-lite
Project-URL: Repository, https://github.com/echo313unfolding/mamba-scan-lite
Project-URL: Issues, https://github.com/echo313unfolding/mamba-scan-lite/issues
Keywords: mamba,mamba2,zamba2,ssm,state-space-model,memory-efficient,oom-fix,transformers,huggingface,small-gpu,quantization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.45
Dynamic: license-file

# mamba-scan-lite

**Memory-efficient Mamba2 scan for HuggingFace Transformers. Fixes Zamba2 OOM on small GPUs. No CUDA compilation required.**

## The Problem

Running Zamba2 models on GPUs with less than 8 GB VRAM fails with `OutOfMemoryError`, even though the model weights fit. The cause is HuggingFace's naive Mamba2 scan implementation, which materializes GB-sized intermediate tensors for the SSM computation.

The official fix is to install `mamba-ssm`, but that requires CUDA compilation that fails on many setups (driver mismatches, missing headers, ABI conflicts).

## The Fix

```bash
pip install mamba-scan-lite
```

```python
import mamba_scan_lite  # patches HF Zamba2 automatically

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Zyphra/Zamba2-2.7B-instruct",
    device_map="cuda",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B-instruct")

inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

That's it. One import, no configuration.

## What It Does

Replaces two components in HF's `Zamba2MambaMixer.torch_forward`:

1. **Chunked vectorized SSM scan** (v0.2.0) instead of the chunked SSD that allocates GB-sized 6D tensors. Within each chunk, the linear recurrence is solved in closed form via cumulative products and prefix sums — no Python loop. Across chunks, state is carried sequentially. Memory stays at ~32 MB working set instead of 1+ GB.

2. **Manual conv1d** via unfold+einsum instead of `F.conv1d`, which avoids cuDNN initialization failures on older drivers (Turing/SM75 GPUs).

The decode path (single-token generation with KV cache) is unchanged.

## Benchmarks

Tested on NVIDIA Quadro T2000 (4 GB VRAM, Turing SM75):

| Model | Without Patch | v0.1.0 Sequential | v0.2.0 Chunked | VRAM Peak |
|-------|--------------|-------------------|----------------|-----------|
| Zamba2-1.2B | OOM | 2.3 tok/s | **2.3 tok/s** (decode unchanged) | 1,560 MB |
| Zamba2-2.7B-Instruct | OOM | 1.0 tok/s | **1.0 tok/s** (decode unchanged) | 3,134 MB |

Prefill speedup (v0.2.0 vs v0.1.0, Zamba2-2.7B on T2000):

| Sequence Length | v0.1.0 Prefill | v0.2.0 Prefill | Speedup |
|----------------|---------------|----------------|---------|
| 32 tokens | 1.56s | 0.81s | 1.9x |
| 64 tokens | 3.14s | 0.85s | 3.7x |
| 128 tokens | 6.37s | 1.23s | 5.2x |
| 256 tokens | 13.04s | 2.24s | 5.8x |

Decode speed (single-token generation) is identical between versions — the chunked scan only affects prefill.

## What's New in v0.2.0

- **Chunked vectorized scan** replaces the token-by-token sequential loop from v0.1.0
- **1.9x–5.8x prefill speedup** depending on sequence length (longer sequences benefit more)
- **NaN-safe fallback**: automatically falls back to sequential processing when cumulative decay spans >80 orders of magnitude (rare edge case with extreme decay rates)
- **Token-identical output**: proven via comparison tests on real Zamba2 weights — the chunked and sequential scans produce bit-identical results

## When to Use This

- You get `OutOfMemoryError` running Zamba2 on a small GPU
- `mamba-ssm` won't compile on your system
- You get `cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`
- You want to run Zamba2 without any CUDA kernel compilation

## When NOT to Use This

- You have `mamba-ssm` installed and working (the official kernels are faster)
- You're on a large GPU (24+ GB) where the naive path doesn't OOM
- You need maximum throughput (this trades speed for memory, though v0.2.0 closes the gap significantly)

## How It Works

The HF naive Mamba2 scan computes the structured state space dual (SSD) via chunked matrix operations:

```python
# HF naive path — allocates ~1+ GB intermediate
G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]  # 6D tensor
M = (G * L).sum(-1)
Y_diag = (M[..., None] * hidden_states[:, :, None]).sum(3)        # OOM here
```

The v0.2.0 chunked scan solves the recurrence in closed form per chunk:

```python
# mamba-scan-lite v0.2.0 — ~32 MB working set
for chunk in chunks(seq_len, chunk_size=32):
    cumA = cumprod(decay_factors)           # vectorized
    scaled_b = inputs / cumA                # element-wise
    prefix = cumsum(scaled_b)              # vectorized
    h_chunk = cumA * (h_init + prefix)     # element-wise
    y_chunk = readout(h_chunk, C) + D * x  # element-wise
    h_init = h_chunk[-1]                   # carry state
```

Same math. O(state_size) memory. 4 vectorized ops per chunk instead of chunk_size loop iterations.

## Requirements

- Python >= 3.9
- PyTorch >= 2.0
- Transformers >= 4.45 (Zamba2 support)

## Also From EchoLabs

- [helix-substrate](https://pypi.org/project/helix-substrate/) — Universal model weight compression via HXQ (2D Vector Quantization). Faster-than-dense inference on compressed models.
- [EchoLabs33 on HuggingFace](https://huggingface.co/EchoLabs33) — 14 compressed models across Transformer, SSM, and Hybrid architectures.

## License

MIT
