Metadata-Version: 2.4
Name: spectralquant
Version: 0.3.1
Summary: Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression with FP16-equivalent quality, drop-in for HuggingFace LLMs and vision transformers.
Author-email: Anirudh Bharadwaj Vangara <anirudh@sentra.app>, Ashwin Gopinath <ashwin@sentra.app>
License: MIT
Project-URL: Homepage, https://github.com/Dynamis-Labs/spectralquant_package
Project-URL: Repository, https://github.com/Dynamis-Labs/spectralquant_package
Project-URL: Documentation, https://github.com/Dynamis-Labs/spectralquant_package#readme
Project-URL: Issues, https://github.com/Dynamis-Labs/spectralquant_package/issues
Keywords: kv-cache,compression,quantization,llm,attention,transformer,inference,spectral,eigenspectral,water-filling,huggingface,vision-transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.11
Requires-Dist: tqdm>=4.65
Provides-Extra: hf
Requires-Dist: transformers>=4.40.0; extra == "hf"
Requires-Dist: accelerate>=0.27.0; extra == "hf"
Provides-Extra: vit
Requires-Dist: transformers>=4.40.0; extra == "vit"
Requires-Dist: Pillow>=10.0; extra == "vit"
Provides-Extra: alphafold
Requires-Dist: transformers>=4.40.0; extra == "alphafold"
Provides-Extra: esmfold
Requires-Dist: transformers>=4.40.0; extra == "esmfold"
Provides-Extra: videomae
Requires-Dist: transformers>=4.40.0; extra == "videomae"
Requires-Dist: Pillow>=10.0; extra == "videomae"
Provides-Extra: video
Requires-Dist: transformers>=4.40.0; extra == "video"
Requires-Dist: Pillow>=10.0; extra == "video"
Requires-Dist: av>=10.0.0; extra == "video"
Provides-Extra: examples
Requires-Dist: transformers>=4.40.0; extra == "examples"
Requires-Dist: accelerate>=0.27.0; extra == "examples"
Requires-Dist: datasets>=2.14; extra == "examples"
Requires-Dist: Pillow>=10.0; extra == "examples"
Requires-Dist: requests>=2.28; extra == "examples"
Requires-Dist: av>=10.0.0; extra == "examples"
Requires-Dist: numpy>=1.24; extra == "examples"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: all
Requires-Dist: spectralquant[alphafold,dev,esmfold,examples,hf,videomae,vit]; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/Dynamis-Labs/spectralquant_package/main/assets/spectralquant_banner.png" alt="SpectralQuant" width="100%">
</p>

# SpectralQuant

Eigenspectral KV cache compression for transformer inference. Up to 6.55x
compression of the KV cache with FP16-equivalent output quality.

```
pip install spectralquant
```

## What it does

Modern LLM inference is bottlenecked by the size of the KV cache. The cache
grows linearly with sequence length and consumes more memory than the model
weights themselves at long context. SpectralQuant compresses that cache by
exploiting the fact that, after a per-head spectral rotation, only a small
number of dimensions actually carry information.

A short calibration step measures the eigenstructure of each attention head.
Each head's keys and values are then split into a high-variance "semantic"
band and a low-variance "tail" band. The semantic band gets a generous bit
budget; the tail gets one or two bits. Total cache size shrinks by 6.55x with
output quality indistinguishable from FP16.

The package ships pure-PyTorch kernels and HuggingFace integrations. There
are no custom CUDA dependencies. It runs anywhere torch runs.

## Quickstart

```python
import torch
import spectralquant as sq
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

engine = sq.SpectralQuant(compression="high")  # 6.55x preset

out = engine.generate(
    model, tok,
    "Explain water-filling bit allocation in two sentences.",
    max_new_tokens=120,
)

print(out["text"])
print(f"{out['stats']['ratio']:.2f}x compression, "
      f"{out['stats']['tokens_per_second']:.1f} tok/s")
```

The first call to `engine.generate(...)` runs a one-time calibration with a
bundled 64-sentence corpus. Subsequent calls reuse it. You can also pass your
own domain-specific corpus.

## Compression presets

```python
print(sq.describe_presets())
```

| preset     | ratio  | risk       | notes                                             |
|------------|--------|------------|---------------------------------------------------|
| `standard` | 5.95x  | safe       | Paper baseline. Production default.               |
| `high`     | 6.55x  | safe       | Validated on Mistral 7B and Qwen 2.5 7B.          |
| `max`      | 6.68x  | edge       | First paragraph clean. Light repetition possible. |

You can also override individual dials when you need them:

```python
engine = sq.SpectralQuant(
    compression="high",
    d_eff_variance=0.93,   # override one knob
)
```

The dials are `avg_bits`, `noise_bits`, `value_noise_bits`, and
`d_eff_variance`. Anything unset falls back to the named preset.

## Supported models

Tested and verified:

| family            | example                                       | works     |
|-------------------|-----------------------------------------------|-----------|
| Mistral           | `mistralai/Mistral-7B-Instruct-v0.3`          | yes       |
| Qwen 2.5          | `Qwen/Qwen2.5-7B-Instruct`                    | yes       |
| Llama 3.x         | `NousResearch/Meta-Llama-3.1-8B-Instruct`     | yes       |
| SmolLM2           | `HuggingFaceTB/SmolLM2-135M`                  | yes       |
| Gemma 2           | `google/gemma-2-9b`                           | expected  |

The cache-level integration works with any HuggingFace causal LM that uses
`DynamicCache` (transformers >= 4.40). RoPE-based architectures with grouped
query attention are the primary target.

For non-LLM transformers (ViT, ESMFold, VideoMAE, AlphaFold) see the modules
in `spectralquant.integrations`. Vision transformers can actually see a
quality *improvement* over FP16 because the eigenspectral filtering removes
noise in the low-variance directions.

## Hardware

| GPU                     | memory  | recommended for                    |
|-------------------------|---------|------------------------------------|
| H100 / H200             | 80–141 GB | 7B, 13B, 70B inference, batch decode |
| A100 80 GB              | 80 GB   | 7B and 13B inference                 |
| A100 40 GB / A6000      | 40–48 GB | 7B inference, short context          |
| RTX 4090 / 4080 / 3090  | 24 GB   | 7B inference at FP16, short context  |
| T4 / RTX 3060           | 12–16 GB | smaller models, demo runs            |
| CPU                     | n/a     | works, but slow                      |

The compression ratios above were measured on H200 with Mistral 7B and Qwen
2.5 7B at sequence length 512. Compression is sequence-length agnostic so
ratios hold at longer contexts; speed gains scale with context length because
the FP16 baseline gets slower while the SQ decode stays linear.

## Generating with a pre-compressed prefix

Useful when you want to keep one compressed cache and reuse it across many
completions of the same long prefix.

```python
result = engine.compress_prefill(model, tok, long_prefix)
cache  = result["cache"]                 # a fresh DynamicCache, FP16 surface
print(f"prefix compression: {result['stats']['ratio']:.2f}x")

# Use cache as past_key_values for any number of follow-ups:
inputs = tok(question, return_tensors="pt").to(model.device)
ids = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=200,
)
```

## Custom calibration

The bundled corpus works for general English. For domain-specific workloads
(code, biomedical text, legal filings), pass your own:

```python
my_corpus = [...]   # 32–128 representative samples
engine = sq.SpectralQuant(compression="high")
engine.calibrate(model, tok, my_corpus)
```

Calibration takes a few seconds on H200. You can persist it once and reload
in any future process:

```python
engine.save_calibration("/path/to/calib")
fresh = sq.SpectralQuant(compression="high")
fresh.load_calibration("/path/to/calib", head_dim=128)
```

## How it works (one paragraph)

For each attention head, calibration accumulates the key and value covariance
matrices and eigendecomposes them. The eigenvectors define a per-head
rotation that aligns coordinates with directions of decreasing variance.
After rotation, a *water-filling* allocator distributes bits across
coordinates so that high-variance dimensions get more bits and tail
dimensions get fewer. Two bit budgets are used: a "semantic" budget
(`avg_bits`) for the high-variance band and a "tail" budget (`noise_bits`,
`value_noise_bits`) for the rest. Each coordinate is quantized with a
Lloyd-Max scalar codebook fit to a Gaussian whose variance equals that
coordinate's eigenvalue. Decode rotates back, dequantizes, and the rest of
attention proceeds at full FP16. The math is in
[`engine.py`](src/spectralquant/engine.py).

## Demo notebook

A full end-to-end notebook is included at
[`notebooks/spectralquant_demo.ipynb`](notebooks/spectralquant_demo.ipynb).
It walks through:

1. Install + GPU sanity check
2. The three presets
3. Loading Mistral 7B
4. Side-by-side FP16 vs SpectralQuant on four diverse prompts, for each preset
5. Power-user override
6. Custom calibration
7. Final summary table
8. Save / load round-trip

To run it on a fresh GPU instance:

```bash
unzip -oq spectralquant.zip -d spectralquant
pip install -e ./spectralquant
jupyter notebook notebooks/spectralquant_demo.ipynb
```

## API surface

```python
sq.SpectralQuant(
    compression="standard" | "high" | "max",
    device=None,                       # "cuda" | "mps" | "cpu" | None (auto)
    head_dim=None,                     # inferred from model
    avg_bits=None, noise_bits=None,
    value_noise_bits=None,
    d_eff_variance=None,
)

engine.generate(model, tokenizer, prompt, *, max_new_tokens=128, ...)
engine.compress_prefill(model, tokenizer, prompt)
engine.calibrate(model, tokenizer, calibration_texts=None)
engine.compression_stats()
engine.save_calibration(path)
engine.load_calibration(path, head_dim=128)
```

The lower-level `sq.SpectralQuantEngine` is also exported for users who want
direct access to per-head bit allocations or to use the legacy
attention-level monkey-patch path.

## Measuring quality

The package reports four metrics in `engine.compression_stats()` and in the
`stats` field returned by `.generate(...)`:

* `ratio` — observed prefix-cache compression vs FP16 (bytes / bytes)
* `tokens_per_second` — measured decode throughput
* `seconds` — wall clock for the decode step
* `compressed_bytes`, `fp16_bytes` — raw byte counts

For independent quality validation you can run perplexity on WikiText:

```bash
python examples/run_perplexity.py --model mistralai/Mistral-7B-Instruct-v0.3
```

Or sweep parameters to find the sweet spot for a model not in our test set:

```bash
python examples/sweep_compression.py --model <hf_repo>
```

## Authors

- Anirudh Bharadwaj Vangara — <anirudh@sentra.app>
- Ashwin Gopinath — <ashwin@sentra.app>

Bug reports, feature requests, and pull requests are welcome on
[GitHub](https://github.com/Dynamis-Labs/spectralquant_package).

## License

MIT.

## Citation

```bibtex
@misc{spectralquant2026,
  title  = {SpectralQuant: Eigenspectral KV Cache Compression},
  author = {Vangara, Anirudh Bharadwaj and Gopinath, Ashwin},
  year   = {2026},
}
```
