Metadata-Version: 2.4
Name: tiny-turboquant
Version: 0.15.0
Summary: Tiny, drop-in Hugging Face KV cache compression using KIVI-style low-bit KV storage.
Author: tiny-turboquant contributors
License: MIT
Project-URL: Homepage, https://github.com/pradeepboopathy/tiny-turboquant
Project-URL: Issues, https://github.com/pradeepboopathy/tiny-turboquant/issues
Keywords: llm,kv-cache,quantization,inference,turboquant,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.2.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: numpy>=1.24.0
Provides-Extra: server
Requires-Dist: uvicorn; extra == "server"
Requires-Dist: fastapi; extra == "server"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-benchmark; extra == "dev"
Dynamic: license-file

# tiny-turboquant

A `transformers.DynamicCache` you can swap in for `past_key_values=` to
shrink KV-cache storage by 3–4× on any HuggingFace causal LM. No
calibration set, no fine-tuning, no model surgery — just construct and
swap.

```python
from tiny_turboquant import TinyKVCache

cache = TinyKVCache(compression="safe")
outputs = model(**inputs, past_key_values=cache, use_cache=True)
```

That's the whole integration. Everything below is optional.

---

## Install

```bash
pip install tiny-turboquant
```

Or from source:

```bash
git clone https://github.com/pradeepboopathy/tiny-turboquant
cd tiny-turboquant && pip install -e .
```

`torch >= 2.2`, `transformers >= 4.40`, `numpy`, `scipy`. Python 3.10+.

## Quick Start

### Drop into any HuggingFace model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from tiny_turboquant import TinyKVCache
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Create a compressed cache
cache = TinyKVCache(compression="safe")   # ~3× smaller, near-lossless

# Use it like a normal past_key_values
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)
```

### Asymmetric K/V (more savings, similar quality)

```python
cache = TinyKVCache(key_bits=4, value_bits=2)   # ~4× smaller
```

### Protect boundary layers (best quality at the same bits)

```python
cache = TinyKVCache.for_model(
    model, compression="balanced", protect_first=2, protect_last=2,
)
```

### Let the library pick bits for you

```python
cache = TinyKVCache.from_pretrained_model(model, tokenizer)
```

Runs one short forward pass to measure K/V dynamic range, then constructs
the cache with the recommended `key_bits`/`value_bits`. Pass
`return_report=True` to also get the `KVNormReport` back.

### Run the inference server

Ships with an OpenAI-compatible HTTP server (FastAPI, SSE streaming).
Point any OpenAI client at it:

```bash
pip install "tiny-turboquant[server]"
tiny-turboquant-server --model Qwen/Qwen2.5-3B-Instruct --compression safe --port 8000
```

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'
```

Set `"stream": true` in the request body to get token-by-token SSE chunks.

## Knobs that actually move the needle

**Compression preset.** The default knob is a named preset, not a bit
count. Four settings cover every real use case:

```python
TinyKVCache(compression="safe")        # K4/V4   default, best quality (~3.3x)
TinyKVCache(compression="balanced")    # K4/V3   recommended tradeoff
TinyKVCache(compression="compact")     # K3/V3   storage-focused
TinyKVCache(compression="aggressive")  # K3/V2 group=32   experimental
```

Under the hood these map to `(key_bits, value_bits)` of `(4, 4)`,
`(4, 3)`, `(3, 3)`, and `(3, 2)` respectively. If you want to pick the
bits yourself, pass them directly instead of `compression=`:

```python
TinyKVCache(key_bits=4, value_bits=2)
```

The two paths are mutually exclusive — mixing them raises `ValueError`.

**Pick a preset from the model itself.** Keys usually carry more dynamic
range than values in Llama/Qwen/Mistral-family models. The diagnostic runs
one forward pass with hooks on every `k_proj` / `v_proj` and tells you
whether the asymmetric preset will help:

```python
from tiny_turboquant import measure_kv_norm_ratio, recommend_bits

report = measure_kv_norm_ratio(model, tokenizer)

# Quality-first default: safest public recommendation.
rec = recommend_bits(report)  # usually K4/V4

# Tradeoff mode: more compression, validate on your task.
# rec = recommend_bits(report, mode="balanced")  # often K4/V3

cache = TinyKVCache.for_model(
    model,
    key_bits=rec["key_bits"],
    value_bits=rec["value_bits"],
)
```

**Boundary-layer protection.** Embedding-adjacent and logit-adjacent layers
are noticeably more bit-sensitive than the middle. Keeping the outer
layers at FP16 costs little memory and recovers most of the small PPL
delta:

```python
TinyKVCache.for_model(
    model, compression="balanced", protect_first=2, protect_last=2,
)
```

Use `for_model` (not the bare constructor) whenever you set `protect_last`
or pass negative indices — the cache needs to know
`model.config.num_hidden_layers`.

| arg | default | notes |
|---|---|---|
| `compression` | – | preset: `"safe"` \| `"balanced"` \| `"compact"` \| `"aggressive"` |
| `bits` | 4 | (advanced) symmetric bit width, used when no preset |
| `key_bits` / `value_bits` | – | (advanced) per-stream bit widths; override `bits` |
| `protect_first` / `protect_last` | 0 | keep this many boundary layers at FP16 |
| `protected_layers` | – | explicit layer indices (negatives allowed via `for_model`) |

## Command-line tools

Three ship with the package — all thin wrappers; the actual work happens
in `TinyKVCache`.

```bash
# OpenAI-compatible HTTP server (FastAPI + SSE streaming)
pip install "tiny-turboquant[server]"
tiny-turboquant-server --model Qwen/Qwen2.5-3B-Instruct --compression safe

# Sliding-window WikiText-2 perplexity, writes a JSON report
tiny-turboquant-eval --model Qwen/Qwen2.5-3B-Instruct --key-bits 4 --value-bits 2

# Token-by-token agreement A/B against stock DynamicCache
tiny-turboquant-validate --model Qwen/Qwen2.5-3B-Instruct --compression balanced
```

Server endpoints: `POST /v1/chat/completions` (supports `"stream": true` SSE),
`GET /v1/models`, `GET /health`.

## The algorithm in three lines

```
1. for keys:   per-channel asymmetric affine quant (min/max along token axis)
               within a block of 128 tokens — absorbs outlier channels
2. for values: per-token asymmetric affine quant (min/max along channel axis)
               within the same block — matches the granularity attention
               actually reads at
3. keep the most recent N tokens (default 128) uncompressed in an FP16
               residual window
```

This matches the KIVI-2 scheme. Earlier versions used random-rotation +
a closed-form Beta-distribution codebook from the TurboQuant paper; that
path was elegant but assumed a marginal distribution real attention
streams don't have, and degraded badly at low bits. Empirical per-channel
scaling is both simpler and substantially more accurate on real models.

## Measured quality

WikiText-2-raw test split, sliding window 2048, prompt 5230 tokens. Strictly
monotone Pareto frontier across both model sizes:

| preset | bits | 3B ratio / ΔPPL | 0.5B ratio / ΔPPL | status |
|---|---|---|---|---|
| `safe`       | K4/V4         | 3.34x / +0.07 | 3.26x / +0.23 | default |
| `balanced`   | K4/V3         | 3.71x / +0.24 | 3.61x / +0.59 | recommended tradeoff |
| `compact`    | K3/V3         | 4.18x / +0.45 | 4.05x / +1.41 | storage-focused |
| `aggressive` | K3/V2 gs=32   | 4.31x / +0.81 | 4.31x / +2.80 | experimental |

`aggressive` applies **grouped V quantization** (`value_group_size=32`):
each token's value vector is split into `head_dim / 32` groups and each
group gets its own scale/zero pair. This is what lets 2-bit V stay
usable. To opt out, pass `value_group_size=None`.

`safe` is near-lossless and scales with model size — the PPL gap shrinks
from +0.23 on 0.5B to +0.07 on 3B. Reach for `balanced` when you want
more compression with mild quality cost; `compact` when storage matters
more than a small PPL bump; `aggressive` only after validating on your
task (the 0.5B numbers warn that quality cliffs exist on small models).

The MSE-only variant (`TinyQuantizer`) is what the cache uses. A two-stage
MSE + QJL variant (`TinyQuantizerIP`) ships for completeness but is
deprecated for attention — softmax exponentially amplifies the
JL-projection noise.

## CUDA fast path

A fused dequant-then-attention CUDA kernel is planned for the new
per-channel-K / per-token-V layout. The kernel in `cuda/` was written
against the old rotation-based path and is not used by the current
cache; dequant runs in PyTorch on the read path until the rewrite
lands.

## Where it earns its keep, where it doesn't

Useful when KV memory is the bottleneck — long contexts on a single GPU,
many concurrent serving sessions, or running one model size larger by
buying back VRAM from the cache. Not useful for short contexts (< 1k
tokens; the cache is already small), for hybrid / recurrent architectures
that don't keep a standard KV cache (Mamba, RWKV), or for tasks that need
bit-exact reproducibility.

## References

- KIVI: [A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750)
  (Liu et al., ICML 2024). The current cache uses the KIVI-2 layout:
  per-channel K, per-token V.
- TurboQuant: [Online Vector Quantization with Near-optimal Distortion Rate](https://arxiv.org/abs/2504.19874)
  (Zandieh et al., ICLR 2026). Earlier versions of this package
  implemented the rotation + Beta-codebook scheme from this paper;
  it ships as `TinyQuantizer` for standalone vector quantization but
  is no longer used by `TinyKVCache`.

Architecture deep-dive: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).

## License

MIT — see [LICENSE](LICENSE).
