Metadata-Version: 2.4
Name: turbo-attn
Version: 0.1.1
Summary: Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.
Author-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
Maintainer-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
License: MPL-2.0
Project-URL: Homepage, https://github.com/arbi-dev/turbo_attn
Project-URL: Repository, https://github.com/arbi-dev/turbo_attn
Project-URL: Issues, https://github.com/arbi-dev/turbo_attn/issues
Project-URL: Changelog, https://github.com/arbi-dev/turbo_attn/blob/main/CHANGELOG.md
Keywords: kv-cache,quantization,vllm,sglang,flash-attention,llm-inference,transformers,cuda
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=2.1
Provides-Extra: codebook
Requires-Dist: scipy>=1.10; extra == "codebook"
Provides-Extra: triton
Requires-Dist: triton>=3.0; extra == "triton"
Provides-Extra: flashinfer
Requires-Dist: flashinfer>=0.6; extra == "flashinfer"
Provides-Extra: vllm
Requires-Dist: vllm>=0.19; extra == "vllm"
Provides-Extra: flash-attn
Requires-Dist: flash-attn>=2.5; extra == "flash-attn"
Provides-Extra: eval
Requires-Dist: lm-eval>=0.4.5; extra == "eval"
Requires-Dist: ray; extra == "eval"
Requires-Dist: datasets; extra == "eval"
Requires-Dist: langdetect; extra == "eval"
Requires-Dist: immutabledict; extra == "eval"
Requires-Dist: nltk; extra == "eval"
Requires-Dist: sacrebleu; extra == "eval"
Requires-Dist: absl-py; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: scipy>=1.10; extra == "dev"
Requires-Dist: triton>=3.0; extra == "dev"
Provides-Extra: all
Requires-Dist: turbo-attn[codebook,eval,flash-attn,flashinfer,triton,vllm]; extra == "all"
Dynamic: license-file

# Turbo Attention

[![CI](https://github.com/arbi-dev/turbo_attn/actions/workflows/ci.yml/badge.svg)](https://github.com/arbi-dev/turbo_attn/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/turbo-attn.svg)](https://pypi.org/project/turbo-attn/)
[![License](https://img.shields.io/badge/license-MPL--2.0-blue.svg)](LICENSE)

**A modular attention backend for vLLM and SGLang. Custom CUDA kernels,
full CUDAGraph capture, asymmetric K/V quantization, hybrid-model
support. 3.8× more KV context on the same GPU, under 1% accuracy loss.
Built on FlashAttention.**

PyPI: `turbo-attn` · Import: `tqkv` · License: MPL-2.0

## Install

```bash
pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness
```

## Quickstart

```python
import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4)
keys = torch.randn(8, 128)

packed, norms = codec.compress_k(keys)        # 4096 B → 512 B + 16 B
recon = codec.decompress_k(packed, norms)
```

Serve a model end-to-end:

```bash
# vLLM
vllm serve Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend custom \
    --max-model-len 250000 --trust-remote-code

# SGLang (after `import tqkv.integrations.sglang as t; t.register()`)
python -m sglang.launch_server --model-path Qwen/Qwen3.5-0.8B \
    --kv-cache-dtype tqkv --attention-backend tqkv --trust-remote-code
```

Full walkthroughs:
- [vLLM quickstart](docs/tutorials/vllm_quickstart.md)
- [SGLang quickstart](docs/tutorials/sglang_quickstart.md)
- [examples/](examples/) — runnable Python snippets
- [ARCHITECTURE.md](ARCHITECTURE.md) — codebase tour for contributors
- [docs/public_api.md](docs/public_api.md) — supported API surface

## Repo layout

The supported public surface lives at the top of each tree:

- `tqkv/` — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
- `docs/`, `docker/`, `scripts/`, `experiments/` — public docs, deploy recipes, helper scripts, research notes.
- Anything under an `internal/` subdirectory (`docs/internal/`, `docker/internal/`, `scripts/internal/`, `experiments/internal/`) is engineering-only and unsupported — it may move or break between releases. The wheel never ships these.

---

## Why this exists

KV cache memory grows linearly with context length and dominates GPU memory beyond ~32K tokens. Google's TurboQuant (Zandieh et al., ICLR 2026) solved this in principle: a near-optimal KV compression scheme using Walsh–Hadamard rotation and Lloyd–Max codebooks, with provable distortion bounds. Multiple open-source references have appeared since the paper — but none ship production CUDA kernels, full CUDAGraph capture, or validation on the new generation of hybrid state-space / attention architectures.

**turbo_attn is the implementation that actually ships in production.** Custom CUDA throughout. Full `FULL_AND_PIECEWISE` CUDAGraph capture on prefill and decode. Asymmetric K/V bit widths across all nine `{2,4,8}²` combinations. Per-group block pools so attention-plus-Mamba models don't waste the memory they save. Drop-in backends for both vLLM and SGLang.

## Headline results

**Qwen3.5-27B-AWQ on 2× RTX 4090 (TP=2, TQ4 KV, MTP=3):**[^27b]

| Metric | BF16 KV | turbo_attn (TQ4) |
|---|---|---|
| KV capacity | 370K tokens | **1,360K tokens (3.7×)** |
| Throughput (1 user) | 131 tok/s | 117 tok/s (89%) |
| Throughput (8 users) | — | **355 tok/s** |
| Throughput (128 users) | — | **1,393 tok/s** |
| TTFT | 58 ms | 78 ms |
| Needle @ 99K | — | **FOUND** |

[^27b]: Measured on an internal AWQ-INT4 build of Qwen3.5-27B; the artifact itself is not on the public HF Hub. The methodology, flags, and harness are public — numbers reproduce on any AWQ-quantized 27B-class model that fits the same TP=2 / 2× RTX 4090 layout.

**Qwen3.5-0.8B at 250K context (single RTX 4090):**

| Config | Prefill tok/s | Decode tok/s | Needle @ 250K | Compression |
|---|---|---|---|---|
| BF16 KV | 17,908 | 336 | OOM at 64K | 1.0× |
| **TQ4** | **17,347** | **354** | **FOUND** | **3.8×** |
| TQ3 | 17,499 | 296 | FOUND | 4.9× |
| TQ2 | 17,614 | 314 | MISS (found @ 100K) | 7.1× |

**TQ4 decode is 5% faster than BF16** at long context — compressed KV reads less memory, and the dequant overhead is smaller than the bandwidth savings.

**Perplexity (Wikitext-2, Qwen3.5-0.8B):** Measured via vLLM `prompt_logprobs` on the production prefill kernel. Additional models and context lengths in progress.

| Config | PPL | Δ vs BF16 | (Δ%) |
|---|---|---|---|
| BF16 | 24.61 | 0.000 | (0.00%) |
| TQ8 (K8V8) | 24.59 | −0.02 | (−0.08%) |
| **TQ4 (K4V4)** | **24.87** | **+0.27** | **(+1.08%)** |
| TQ3 | 25.67 | +1.07 | (+4.33%) |
| TQ2 (K2V2) | 27.87 | +3.27 | (+13.28%) |
| K4V8 | 24.62 | +0.01 | (+0.06%) |
| K8V4 | 24.81 | +0.20 | (+0.81%) |
| K2V4 | 25.50 | +0.90 | (+3.65%) |
| K6V3 | 25.52 | +0.91 | (+3.71%) |
| K8V2 | 27.42 | +2.81 | (+11.43%) |
| K4V2 | 27.40 | +2.79 | (+11.35%) |

<sub>Measured via vLLM `prompt_logprobs` on the production turbo_attn prefill kernel (not the experimental HF reference path). 8 chunks × 512 wikitext-2 tokens, subprocess-isolated. SEM ≈ ±2 PPL across chunks.</sub>

## Reproduce the headline numbers

Every number in this README is reproducible from `benchmarks/run_all.py`:

```bash
pip install turbo-attn
python benchmarks/run_all.py --profile smoke   # ~2 minutes, sanity check
python benchmarks/run_all.py --profile quick   # ~8 minutes, headline numbers
```

Output lands in `benchmarks/results/<timestamp>/` with `summary.md` (human-readable table), `summary.json` and per-suite CSVs (`perplexity.csv`, `throughput.csv`, `memory.csv`), per-subprocess logs under `logs/`, and a `system.json` stamp recording GPU, driver, CUDA, package versions, and git SHA.

Reference numbers we publish against are committed under [`benchmarks/baselines/qwen3.5-0.8b/`](benchmarks/baselines/qwen3.5-0.8b/) and [`touchstone/baselines/`](touchstone/baselines/). To stand up a local server for ad-hoc prompts, see [`examples/02_vllm_server.py`](examples/02_vllm_server.py).

### Hardware tested

| GPU class | SM | Status |
|---|---|---|
| RTX 4090 / L40 / Ada | 89 | tested in CI and on the headline benches |
| A100 / Ampere | 80, 86 | builds; not yet validated end-to-end |
| H100 / Hopper | 90 | builds; FP8 attention not yet validated |
| B200 / Blackwell | 100 | not yet ported |

## What turbo_attn is — and isn't

turbo_attn is an **attention backend** in the vLLM/SGLang sense: it owns the KV cache layout and the kernels that read it. Everything outside attention is untouched and orthogonal.

### Layers: who owns what

| Layer | Owner | turbo_attn's role |
|---|---|---|
| Tokenizer, sampler, guided decoding, LoRA | vLLM | none |
| Request scheduler, batching, chunked prefill, prefix caching | vLLM | we consume what it sends |
| Paged block allocator | vLLM | we register our per-token byte budget |
| **KV cache on-wire format** | **turbo_attn** | rotate + quantize + bitpack into paged blocks |
| **Attention compute (decode)** | **turbo_attn** | fused CUDA kernel, end-to-end |
| **Attention compute (prefill)** | **turbo_attn** | one of three paths; see below |
| CUDAGraph capture orchestration | vLLM | we declare capturable modes; vLLM captures |
| Hybrid-model block dispatch | vLLM + our per-group `BlockPool` | we ensure compressed pages don't inflate |
| Weight loading (BF16 / AWQ / GPTQ / FP8) | vLLM + quant libs | orthogonal; composes cleanly |
| TP/PP collectives | vLLM | orthogonal; we run per-rank |

### Attention compute by run mode

| Run mode | KV storage | KV load + dequant | Q·K | softmax | P·V | output |
|---|---|---|---|---|---|---|
| **Decode (fused CUDA)** | ours (compressed) | **ours** | **ours** | **ours** (online) | **ours** | **ours** |
| **Decode split-K** (batch=1) | ours | **ours** | **ours** | **ours** (2-phase) | **ours** | **ours** |
| **MTP verification (BLOCK_M 2–8)** | ours | **ours** | **ours** | **ours** | **ours** | **ours** |
| **Prefill — FA4 inline-dequant** | ours | **ours** (subclass override) | FA4 | FA4 | FA4 | FA4 |
| **Prefill — CUDA C++ (v9)** | ours | **ours** | **ours** (`mma.sync`) | **ours** | **ours** (`mma.sync`) | **ours** |
| **Prefill — decompress + stock FA** | ours → BF16 scratch | **ours** (Triton) | stock FA | stock FA | stock FA | stock FA |

Stock FlashAttention is never called on compressed bytes. When it runs, it runs unmodified on a decompressed scratch buffer.

## What's novel

**1. Our own warp-fused compress kernel.** Walsh–Hadamard rotation via warp butterfly (`__shfl_xor_sync`, five shuffle stages), quantization, and bit-packing — all in a single CUDA launch. Measured 5× faster than a Triton baseline (~7 μs vs ~35 μs per layer at batch 1 on RTX 4090), contributing ~10% end-to-end decode speedup.

**2. Unified fused decode kernel, templated on `BLOCK_M`.** Single source handles ordinary decode (`BLOCK_M=1`), speculative decoding verification (`BLOCK_M=2..8`), and asymmetric K/V bit widths (all nine `{2,4,8}²` combinations). NCU-profiled: 40 registers, zero spills, at the per-SM performance ceiling. CUDAGraph-safe. MTP verification is a native fused path, not a fallback.

**3. FA4 with inline dequantization (novel).** A CuTeDSL subclass of FlashAttention-4 that overrides the `load_K` and `load_V` stages to dequantize compressed bytes directly into register tiles during the MMA pipeline. No decompress buffer. Shared-memory pressure actually *decreases* because the staging region for packed bytes is ≤1/8 the size of the BF16 tile it replaces. Generalized over all nine `{2,4,8}²` K/V configurations. Intended as an upstream contribution to FlashAttention.

**4. Production CUDA C++ prefill kernel.** XOR shared-memory swizzle (`col ^ ((row & 7) << 3)` — 1.75× improvement), staged `cp.async.cg` loads, `ldmatrix.sync.aligned.m8n8.x4` fragment loads, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` tensor cores, register-held query tiles, warp-shuffle softmax reductions, `pack_gqa` query layout. Templated on K/V bit widths and head dimensions. 128–200 registers, zero spills, 2 CTAs/SM.

**5. Full `FULL_AND_PIECEWISE` CUDAGraph capture.** Decode captured as a full graph, prefill captured piecewise per layer. Both the FA4 subclass and the CUDA C++ prefill kernel are allocation-free and capture-safe. This is the bleeding-edge vLLM CUDAGraph mode; most backends don't use it yet.

**6. Per-group `BlockPool` for hybrid models.** Attention-plus-Mamba/GatedDeltaNet/ShortConv models suffer from block-pool inflation in vLLM's default allocator, because attention pages get padded to the Mamba state page size. With TQ4 compression this waste becomes catastrophic (effective capacity drops to 2–3%). The fork carries a per-layer-group block pool that lets each layer type use its natural page size. Model-agnostic; benefits any hybrid architecture.

**7. Asymmetric K/V bit widths.** All nine combinations of `{2,4,8}²` in both decode and prefill, configured independently via `TQKV_K_BITS` and `TQKV_V_BITS`.

The **optimal asymmetric K/V configuration is model-specific**. On Qwen3.5-0.8B (hybrid attention + GatedDeltaNet), value precision matters more — K4V8 is essentially lossless (+0.06% PPL) while K8V4 is +0.81%. On other architectures with smaller head dimensions and more attention layers (e.g. Qwen3-0.6B, head_dim=64, 28 attention layers), key precision dominates — K8V4 is essentially lossless there while symmetric TQ4 diverges. The direction depends on which side of the model has wider activation distributions, not a universal rule. We recommend measuring per architecture with the included `kv_norm_profile.py` diagnostic and the `tier2_wikitext_quick.py` sweep harness.

**8. Sliding-window attention support.** Per-layer `SlidingWindowSpec` propagates window size into the decode kernel mask. Works on Gemma, Mistral variants, and any model that declares sliding-window layers.

**9. Native MTP / speculative decoding.** `BLOCK_M>1` verification uses the same unified decode kernel, amortizing the KV read across all verified tokens. No decompress-and-fallback path. MTP=3 reaches ~96% of BF16 speed on Qwen3.5-27B at TQ4 with 3.7× the KV capacity.

**10. Orthogonal to weight quantization.** Composes cleanly with AWQ, GPTQ, FP8, and bitsandbytes. Validated on Qwen3.5-27B-AWQ-4bit at 170K context, TP=2.

## Supported models

| Architecture | Type | Status | Notes |
|---|---|---|---|
| Qwen3.5 (0.8B, 27B) | Hybrid (full-attn + GatedDeltaNet) | ✅ Primary | 27B via AWQ-4bit, TP=2, 1.36M context validated |
| Qwen3.6-MoE | Hybrid MoE (full-attn + GatedDeltaNet + experts) | ✅ Validated | TP=2, 2.94× KV capacity (888k vs 302k BF16); throughput parity warm |
| LFM2-8B-A1B | Hybrid MoE (full-attn + ShortConv + 32 experts) | ✅ Validated | Needle 11/12 (matches BF16), PPL within 4% |
| Gemma, Mistral (sliding window) | Dense w/ SWA | ✅ Backend supports | Spot-tested |
| Llama-3, Mixtral, DeepSeek-V3, Command-R+ | Dense / MoE | 🗺️ Roadmap | Expected to work; not yet explicitly validated |
| Pure Mamba / SSM | State-space | N/A | Mamba has no KV to compress |

turbo_attn is architecture-neutral: any model vLLM can serve with a paged KV cache can use it. The tested list reflects what's been benchmarked, not what works.

## How TurboQuant works

1. **Rotate** each KV vector with a fast Walsh–Hadamard transform. Raw KV values have uneven distributions; after rotation every coordinate follows the same Gaussian.
2. **Normalize** — store the vector's magnitude as a single BF16 value.
3. **Quantize** each coordinate to a shared codebook (4 values for TQ2, 16 for TQ4, up to 256 for TQ8). Because the rotated distribution is uniform across coordinates, a single shared codebook achieves near-optimal distortion.

At TQ4, head_dim=128: 64 packed bytes + 4 norm bytes = 68 B vs 256 B at BF16 = **3.8× compression**. The same arithmetic extends to all supported widths: 2-bit → 7.1×, 3-bit → 4.9×, 8-bit → 1.9× (lossless).

Attention scores computed on rotated KV are bit-identical to attention on unrotated KV, provided the query is rotated by the same matrix before the dot product. We pre-rotate the query once per request and compute everything in the rotated space, never decompressing into the original coordinate system.

## Architecture

### Decode path

One fused CUDA kernel (`turbo_attn/kernels/_cuda_decode_unified.cu`) handles all decode math: unpacks nibbles from compressed KV pages, looks up centroids from a shared-memory codebook, multiplies by per-token norms, computes Q·K and P·V with online softmax, and writes outputs — in a single pass. No decompress buffer, no external kernel calls. Templated on `HEAD_DIM`, `GQA_RATIO`, `BLOCK_M`, `TQ_K_BIT_WIDTH`, and `TQ_V_BIT_WIDTH`; the compiler generates one variant per tuple. Split-K variant available for `batch=1` workloads where grid utilization is the bottleneck.

### Prefill: three paths, `FULL_AND_PIECEWISE` capture

All three prefill paths run at approximately BF16-FlashAttention speed on tested workloads.

- **Path A — FA4 inline-dequant (default).** CuTeDSL subclass of FlashAttention-4 that overrides K/V loads to dequantize directly into register tiles. No decompress buffer. CUDAGraph-captured via piecewise mode.
- **Path B — CUDA C++ prod kernel (`TQKV_PREFILL_ENGINE=cuda`).** Hand-written kernel with XOR swizzle, `cp.async`, `ldmatrix`, and `mma.sync`. Also piecewise-captured.
- **Path C — Decompress + stock FA (`=decompress`).** Triton decompress into a small scratch buffer, then stock FlashAttention. Fallback for maximal compatibility and for very long chunks where fused dequant overhead compounds. Not CUDAGraph-captured.

An adaptive dispatcher (`=adaptive`) picks Path A for short chunks and Path C for chunks above a hand-tuned crossover. Adaptive is an **explicit opt-in** — it is never auto-selected, because silent switches between engines make perf numbers incoherent and can hide fa4 regressions in benchmarks. Use `fa4` (the default) for stable measurement.

### CUDAGraph mode

The backend requests `CUDAGraphMode.FULL_AND_PIECEWISE` when the prefill engine is `fa4` (the default) or `adaptive` (explicit opt-in). Decode runs as a full captured graph; prefill runs as per-layer piecewise-compiled graphs. The FA4 subclass and the CUDA C++ prefill kernel are both allocation-free and fully compatible with this mode. Any other prefill engine (`decompress`, `triton`, `cuda`) falls back to `FULL_DECODE_ONLY`.

### Compress kernel

Our warp-fused CUDA compress kernel runs on every KV write. One warp per (token, head) pair performs the Walsh–Hadamard rotation via five `__shfl_xor_sync` butterfly stages, reduces the norm via `__shfl_down_sync`, scalar-quantizes against the codebook, bit-packs indices into bytes, and scatters directly to the block offset — all in a single kernel. ~7 μs per layer at batch 1, vs ~35 μs for a Triton baseline.

## Install and use

Full end-to-end walkthroughs for each engine:

- **[vLLM quickstart](docs/tutorials/vllm_quickstart.md)** — install the vLLM fork, serve a model, query it, validate compression.
- **[SGLang quickstart](docs/tutorials/sglang_quickstart.md)** — same flow via SGLang's attention-backend plugin.

### vLLM

For vLLM serving, install the [vllm fork](https://github.com/arbi-dev/vllm) (`turbo-attn` branch). The fork is a thin overlay on `vllm/vllm-openai:v0.19.0` that wires the `"tqkv"` `kv_cache_dtype` through vLLM's config system and adds per-group block-pool bookkeeping for hybrid models. The actual backend (~2000 lines) lives in this package, not the fork; the full per-file breakdown is in [`docker/PATCHES.md`](docker/PATCHES.md). The "Why a vLLM fork (for now)" section below explains why this fork is required today.

Two flags are required for vLLM serving with TQKV:

- `--kv-cache-dtype tqkv` — selects the compressed KV layout
- `--attention-backend custom` — routes attention through the registered TQKV backend

```bash
# 4-bit KV on Qwen3.5-0.8B, single GPU, up to 250K context
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom

# 27B on 2 GPUs, TP=2, TQ4, with MTP-3 speculative decoding
TQKV_BITS=4 vllm serve /path/to/Qwen3.5-27B-AWQ-4bit \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --tensor-parallel-size 2 \
  --language-model-only \
  --max-model-len 170000 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

# Asymmetric K/V: 2-bit keys, 4-bit values (optimal direction is model-specific — measure with tier2_wikitext_quick.py)
TQKV_K_BITS=2 TQKV_V_BITS=4 vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom
```

#### Recommended flags for production performance

TQKV's headline numbers (capacity, TTFT, throughput) assume a few non-default scheduler and compilation settings. The plugin does **not** set these for you — pass them explicitly so what you configure is what you get:

| Flag | Recommended value | Why |
|---|---|---|
| `--max-num-batched-tokens` | `16384` | TQKV's prefill bypass is chunk-size-agnostic; lifting MNBT lets long prompts land in fewer kernel launches → markedly better TTFT at ≥16k prompts |
| `--max-num-seqs` | `<N>` (auto-size) | Without this, the scheduler caps concurrent requests at the bf16 cap and hides TQKV's 3–4× capacity gain. Set so that `N × max_model_len` ≈ your KV budget |
| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'` | as shown | TQKV's FA4 prefill kernel is CUDAGraph-safe; capturing it removes Python launch overhead on prefill |

```bash
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 256 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'
```

Sizing `--max-num-seqs`: a fast estimate is `N ≈ (kv_budget_bytes) / (per_token_bytes × max_model_len)` where `per_token_bytes` for K4V4 is roughly `num_layers × num_kv_heads × head_size × 2` ÷ 4. Bench at increasing values and watch for OOM at request peak.

#### MLA models (DeepSeek V2/V3)

MLA layers need a separate wrapper backend that compresses the shared-KV slot. Set `TQKV_MLA_ENABLE=1` to enable; without it, MLA models will fail at load with "no MLA backend supports tqkv":

```bash
TQKV_MLA_ENABLE=1 vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat \
  --kv-cache-dtype tqkv \
  --attention-backend custom
```

### SGLang

SGLang integration is plugin-based — no fork required. `tqkv.integrations.sglang.register()` installs a pool-factory and wires the `tqkv` attention backend into SGLang's registry. Call it once before SGLang reads its attention-backend registry:

```python
# In your launch script, before SGLang imports its backend registry:
import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()

# Then launch SGLang as usual with:
#   --kv-cache-dtype tqkv --attention-backend tqkv
```

```bash
TQKV_BITS=4 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv
```

### Standalone codec (no serving engine)

```python
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
k_packed, k_norms = codec.compress_k(key_vectors)
k_recon = codec.decompress_k(k_packed, k_norms)

# Pre-rotation trick (fused attention without decompress buffer)
q_rotated = codec.rotate_query(query, scale=1/math.sqrt(head_dim))
output = codec.unrotate_output(raw_output)
```

### HuggingFace Transformers (reference path)

```python
from tqkv.hf_cache import TQKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B").cuda()
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
cache = TQKVCache(bit_width=4)
cache.init_from_model(model, tok)

out = model.generate(
    **tok("Explain entropy in one sentence.", return_tensors="pt").to("cuda"),
    past_key_values=cache, max_new_tokens=64,
)
```

Note: the HF path is a reference implementation used by the perplexity harness. For production serving, use vLLM or SGLang.

## Benchmarking

Accuracy and throughput claims that require running on raw weights cannot be tested through our demo API — you have to install the backend and run benchmarks locally. We ship a reproducible harness and publish our own numbers, and we strongly encourage independent verification.

**Accuracy / quality** (run by us, scripts in `benchmarks/`):

- **Perplexity** on Wikitext-2 and PG-19 (long-context PPL)
- **[RULER](https://github.com/NVIDIA/RULER)** — NVIDIA's long-context benchmark, 13 tasks (multi-key NIAH, variable tracking, aggregation, QA) at 4K–128K+. The long-context standard.
- **[LongBench-v2](https://github.com/THUDM/LongBench)** — realistic long-context QA and reasoning
- **MMLU 5-shot, GSM8K, TruthfulQA, HellaSwag** via `lm-eval-harness` — short-context retention
- **Needle-in-a-haystack** at 100K and 250K — quick sanity check (not a primary metric)

**Throughput / latency:**

- **vLLM `benchmark_serving.py`** with ShareGPT workload at batch {1, 8, 32, 128}
- **TTFT / TPOT distributions** under Poisson arrivals at QPS {1, 4, 16}
- **Max context at fixed GPU memory** — GB per 1K tokens

Reproduction scripts and exact commit hashes live in `benchmarks/`. All of them print CSV-formatted results to `benchmarks/results/<timestamp>/` so you can diff them against ours.

```bash
# Everything the README claims, reproducible end-to-end
python benchmarks/run_all.py --profile full

# Individual suites
python benchmarks/perplexity/run.py --model Qwen/Qwen3.5-0.8B --bits 4
python benchmarks/ruler/run.py      --model Qwen/Qwen3.5-0.8B --bits 4 --ctx 32768,65536,131072
python benchmarks/vllm/run_serving.py --model Qwen/Qwen3.5-0.8B --bits 4 --workload sharegpt
```

## Configuration

All runtime configuration is done through `TQKV_`-prefixed environment
variables. The full surface is below, organised by category. Anything not
listed here is internal and may change without notice.

### Bit width and calibration

| Variable | Default | Description |
|---|---|---|
| `TQKV_BITS` | `4` | Symmetric K/V bit width (2–8). Falls through to `TQKV_K_BITS`/`TQKV_V_BITS` when those are unset. |
| `TQKV_K_BITS` / `TQKV_V_BITS` | inherits `TQKV_BITS` | Asymmetric K/V override (e.g. `K2/V4`). |
| `TQKV_LAYER_BITS` | `""` | Per-layer override string (e.g. `0:8,8;5:2,4`). Layer index → `(k_bits, v_bits)`. |
| `TQKV_CALIBRATION_FILE` | `""` | Path to a calibration JSON bundle produced by `python -m tqkv.auto_calibrate`. |
| `TQKV_ALLOCATION_FILE` | `""` | Path to a per-layer bit-allocation file produced by `python -m tqkv.calibration.solve_bits`. |
| `TQKV_AUTO_CALIBRATE_MODEL` | `""` | Model path for plugin-side auto-calibration. Triggers calibration on first init when set. |
| `TQKV_CALIBRATION_CACHE` | `""` | Directory used by auto-calibration to cache intermediate artefacts. |
| `TQKV_PROFILE` | `none` | Calibration profile from the bundle: `lossless`, `balanced`, `aggressive`. |

### Engine selection

| Variable | Default | Description |
|---|---|---|
| `TQKV_ENGINE` | `""` (auto) | Decode engine: `native_tq` (CUDA SIMT), `flash_attn` (decompress + FA), or `bypass`. Empty = auto-select. |
| `TQKV_PREFILL_ENGINE` | `fa4` | Prefill path: `fa4`, `triton`, `decompress`, or `adaptive`. See "Prefill: three paths" above. |
| `TQKV_PREFILL_BYPASS` | `1` | First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode. |
| `TQKV_BYPASS_INLINE` | `0` | Inline the bypass logic in the runtime hot path (perf experiment). |
| `TQKV_FUSE_QROT` | `""` (auto) | Fused Q-rotation prologue: `on` / `off` / `auto`. Decode-only. |
| `TQKV_O_PROJ_FOLD` | `on` | Fold `rotate_output` into the `o_proj` weights. Default-on, universal across attention layer classes. |
| `TQKV_MTP_SPLITK` | `1` | Use split-K decode kernel for MTP layers. |
| `TQKV_DECODE_SPLITS` | `""` (autotune) | Force decode-kernel split count. Empty = autotuned. |
| `TQKV_MTP_AUTOTUNE_SPLITS` | `1` | Run the autotune loop for MTP split-K. |
| `TQKV_MTP_DECODE_SPLITS` | `""` | MTP decode split count override. |
| `TQKV_KV_TILE_TOKENS` | `""` | KV tile size in tokens (autotuner override). |
| `TQKV_REFERENCE_SEQ_LEN` | `""` | Reference sequence length used by autotune heuristics. |

### Backend behaviour

| Variable | Default | Description |
|---|---|---|
| `TQKV_NO_JIT` | `0` | Fail if a kernel variant is not pre-compiled (strict-mode for production images). |
| `TQKV_K_NC` | `1` | Apply norm-correction to K reads in the dequant path. |
| `TQKV_DISABLE_PRESCALE` | `0` | Disable per-channel pre-scaling on compress upload (debug). |
| `TQKV_DISABLE_HYBRID_REVIEW` | `0` | Disable the hybrid-model (Mamba/GDN) cache-budget review pass. |
| `TQKV_STRICT_NO_SDPA` | `0` | When `1`, raise `RuntimeError` instead of taking the `D>256` SDPA Python-loop fallback (~1000× slower than FA4/cuda-v9). Recommended for Gemma 4 31B and other `head_dim>256` deployments. |
| `TQKV_SIMULATE_NO_FA4` | `0` | Test fallback path by pretending FA4 isn't available. |
| `TQKV_LOG_HOIST` | `""` | When set, log per-layer-builder hoist diagnostics (debug). |

### FA4 prefill scheduler (advanced)

| Variable | Default | Description |
|---|---|---|
| `TQKV_FA4_SPLIT_D` | auto | Head-dim splitting in the FA4 split-D kernel. `1`/`0` to override the auto-detect. |
| `TQKV_FA4_LPT` | `auto` | Longest-processing-time scheduling: `0`, `1`, `auto`. Auto = on for causal-or-local. |
| `TQKV_FA4_HEAD_SWIZZLE` | `1` | Head-axis tile swizzle for L2-locality. Most impactful for Gemma `H_kv=16`. |

### MLA (DeepSeek V2/V3/V4)

| Variable | Default | Description |
|---|---|---|
| `TQKV_MLA_ENABLE` | `0` | Master switch for the MLA backend. When `0`, MLA models route to vLLM's stock backend. |
| `TQKV_MLA_ROPE_HEAD_DIM` | `64` | RoPE head dimension for MLA latent + RoPE split. |

### Cold tier (variant 2 — H2O-style attention-weighted pool)

Off by default. See [`docs/cold_tier_design.md`](docs/cold_tier_design.md)
for current shipping status. Master flag must be set for any other
cold-tier env to take effect.

| Variable | Default | Description |
|---|---|---|
| `TQKV_COLD_ENABLE` | `0` | Master switch. When `0`, every other cold-tier env is ignored and the runtime is bit-identical to main. |
| `TQKV_COLD_HOT_PATH` | `0` | Hot-path opt-in. `1` installs the observer hook; `0` (default) installs state only. |
| `TQKV_COLD_TIERED_ATTEND` | `0` | γ\_merge: full tiered attention with LSE merge. Currently a shipping blocker — leave `0`. |
| `TQKV_COLD_PROFILE` | `""` | One of `balanced`, `aggressive`, `max_compress`. Presets `M`, `P`, `K_PCT`, `W_track`, and per-layer bits. |
| `TQKV_COLD_M` | `8192` | Age threshold: tokens older than this enter the cold tier. |
| `TQKV_COLD_P` | `16` (=block\_size) | Pool slice width. Page-aligned — each eviction frees one vLLM page. |
| `TQKV_COLD_K_PCT` | `1.0` | Percent of each pool slice retained individually as heavy hitters (ranked by pre-softmax Q·K). |
| `TQKV_COLD_W_TRACK` | `64` | Aging-window size (decode steps a token stays in the ring before eviction). Must be ≥ `TQKV_COLD_P`. |
| `TQKV_COLD_BITS_K` / `TQKV_COLD_BITS_V` | hot-tier bits | Scalar override for cold-tier K/V bit width. |
| `TQKV_COLD_LAYER_BITS` | `""` | Per-layer cold-tier override, e.g. `5:8,8;10:4,4`. |
| `TQKV_COLD_NO_POOL_LAYERS` | `"0 1"` | Space-separated layers that skip pooling entirely (L0/L1 are attention-sink layers). |

**Profiles** (`balanced` / `aggressive` / `max_compress`): set bundles
of the above. Expected compression at 1M ctx on Qwen3-4B: ~47×
(balanced/aggressive) and ~100× (max\_compress, unvalidated).

**Status:** flag-off-default. γ\_nomerge path (archive + drain, no LSE
merge) decodes coherently and saves ~5% peak\_MiB at batch=4. γ\_merge
path is gated behind `TQKV_COLD_TIERED_ATTEND=1` and is currently a
shipping blocker pending the multi-seq capacity bug fix.

## Why a vLLM fork (for now)

SGLang doesn't need a fork — its attention-backend plugin surface accepts custom pools via a call-time patch today, so the SGLang integration ships as a pure plugin.

vLLM is structurally different. `CacheDType` in `vllm/config/cache.py` is a Pydantic `Literal` that validates at class-definition time, which blocks runtime plugin registration of any new KV-cache dtype. Until that's relaxed upstream there's no plugin-only path, and we ship a fork.

The fork is a thin overlay on top of `vllm/vllm-openai:v0.19.0`. The full layout — every file we copy, what each patch does, and the v0.19.0 compat guards — lives in [`docker/PATCHES.md`](docker/PATCHES.md). At the time of writing, that's roughly 19 files copied across `config/`, `v1/attention/`, and `v1/core/`, plus one append-patch on `utils/torch_utils.py`. The shape of the patches falls into three groups:

- **Plugin-registration plumbing** (about half the files): adding `"tqkv"` to `CacheDType`, the backend enum, the platform selector, the cache-spec dispatch, and the bytes-layout calculator. This is what would collapse into a pure plugin if `CacheDType` becomes extensible.
- **Hybrid-model KV bookkeeping** (`v1/core/kv_cache_*.py`): keeps per-group page sizes for attention vs. Mamba/GDN layers so compressed attention pages don't get unified up to a Mamba-state page size. Required for LFM2 and Qwen3.5 MoE+GDN; transparent on dense models.
- **Capture-mode and OOM-fallback wiring**: `CUDAGraphMode.FULL_AND_PIECEWISE` for chunked-prefill capture, and the determine-available-memory fallback for compressed KV when the profiler returns 0.

We don't currently have an upstream PR open for the `CacheDType` relaxation — the open question is whether vLLM is willing to accept a plugin-extensible KV-cache-dtype surface; until that conversation happens, the fork is the way. If you want to track or contribute, the canonical state is `arbi-dev/vllm` and `docker/PATCHES.md`.

## Roadmap

- [ ] vLLM upstream: explore `CacheDType` relaxation so tqkv can register as a pure plugin (no PR open yet)
- [ ] FlashAttention upstream: explore upstreaming the FA4 inline-dequant variant (no PR open yet)
- [ ] Validated model matrix: Llama-3, Mixtral, DeepSeek-V3, Command-R+, Gemma
- [ ] TP > 2 validation
- [ ] RULER and LongBench-v2 at 32K/64K/128K across all bit widths
- [ ] SGLang upstream: explore a first-class pool-factory hook so `register()` can become a no-op (no PR open yet)
- [ ] Hopper and Blackwell prefill optimization (TMA, larger smem budgets)

## Citation

If turbo_attn helps your work, please cite both Google's TurboQuant paper and this implementation:

```bibtex
@misc{turbo_attn2026,
  title = {turbo\_attn: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}
```

## License

Mozilla Public License 2.0 (MPL-2.0). See `LICENSE` and `NOTICE`.
