Metadata-Version: 2.4
Name: turbo-attn
Version: 0.6.1
Summary: Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.
Author-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
Maintainer-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
License: MPL-2.0
Project-URL: Homepage, https://github.com/arbi-dev/turbo-attn
Project-URL: Repository, https://github.com/arbi-dev/turbo-attn
Project-URL: Issues, https://github.com/arbi-dev/turbo-attn/issues
Project-URL: Changelog, https://github.com/arbi-dev/turbo-attn/blob/main/CHANGELOG.md
Keywords: kv-cache,quantization,vllm,sglang,flash-attention,llm-inference,transformers,cuda
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=2.1
Provides-Extra: codebook
Requires-Dist: scipy>=1.10; extra == "codebook"
Provides-Extra: triton
Requires-Dist: triton>=3.0; extra == "triton"
Provides-Extra: flashinfer
Requires-Dist: flashinfer>=0.6; extra == "flashinfer"
Provides-Extra: fa4
Requires-Dist: nvidia-cutlass-dsl==4.4.2; extra == "fa4"
Requires-Dist: quack-kernels>=0.3.7; extra == "fa4"
Requires-Dist: apache-tvm-ffi==0.1.10; extra == "fa4"
Provides-Extra: vllm
Requires-Dist: vllm>=0.19; extra == "vllm"
Provides-Extra: flash-attn
Requires-Dist: flash-attn>=2.5; extra == "flash-attn"
Provides-Extra: eval
Requires-Dist: lm-eval>=0.4.5; extra == "eval"
Requires-Dist: ray; extra == "eval"
Requires-Dist: datasets; extra == "eval"
Requires-Dist: langdetect; extra == "eval"
Requires-Dist: immutabledict; extra == "eval"
Requires-Dist: nltk; extra == "eval"
Requires-Dist: sacrebleu; extra == "eval"
Requires-Dist: absl-py; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: scipy>=1.10; extra == "dev"
Requires-Dist: triton>=3.0; extra == "dev"
Provides-Extra: all
Requires-Dist: turbo-attn[codebook,eval,fa4,flash-attn,flashinfer,triton,vllm]; extra == "all"
Dynamic: license-file

# Turbo Attention

[![CI](https://github.com/arbi-dev/turbo-attn/actions/workflows/ci.yml/badge.svg)](https://github.com/arbi-dev/turbo-attn/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/turbo-attn.svg)](https://pypi.org/project/turbo-attn/)
[![License](https://img.shields.io/badge/license-MPL--2.0-blue.svg)](LICENSE)

A modular attention backend for vLLM, SGLang, and HuggingFace Transformers. Custom CUDA + Triton kernels with full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. Built on FlashAttention; based on TurboQuant near-lossless KV cache compression.

PyPI: `turbo-attn` · Import: `tqkv` · License: MPL-2.0

## Install

```bash
pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness
```

## Quickstart

```python
import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
keys = torch.randn(8, 128, device="cuda")

packed, norms = codec.compress_k(keys)
recon = codec.decompress_k(packed, norms)
```

See [`examples/`](examples/) for runnable snippets and
[ARCHITECTURE.md](ARCHITECTURE.md) for a codebase tour.

## Two independently-usable pieces

`turbo-attn` ships two pieces that are sold as a stack but designed to be consumed separately:

1. **Codec → any attention backend.** `TurboKVCodec` is a pure, framework-agnostic compressor: compress with TQKV, decompress to bf16 / fp16, hand the result to vanilla `flash_attn_varlen_func`, FlashInfer, SGLang attention, anything that takes raw KV. See [`examples/06_tqkv_codec_with_third_party_attention.py`](examples/06_tqkv_codec_with_third_party_attention.py).

2. **Kernels → any KV format.** The cute-DSL prefill and split-K paged decode kernels are policy-parametric on the K/V format via the **Loader extension point**. The bundled set is `{TqkvLoader, BypassLoader}`:
   - **`TqkvLoader`** — TQKV centroid-based codec dequant (the production path).
   - **`BypassLoader`** — raw bf16 / fp16 KV, no codec. Useful for apples-to-apples ablations under an otherwise-byte-identical kernel.
   Third-party formats (fp8, int8, nvfp4, …) are **not** shipped — write a sibling Loader for your format. The Loader is the public extension surface; mainloop / scheduler / softmax / epilogue stay turbo-attn's. See [`docs/writing_a_loader.md`](docs/writing_a_loader.md) for a worked fp8 example.

## Repo layout

- `tqkv/` — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
- `tqkv/kernels/loaders/` — bundled cute-DSL prefill Loaders (`tqkv`, `bypass`).
- `tqkv/kernels/_decode_loader_*.cuh` — bundled decode Loaders (`TqkvDecodeLoader`, `BypassDecodeLoader`).
- `docs/`, `docker/`, `scripts/`, `examples/`, `experiments/` — public docs, deploy recipes, helper scripts, runnable examples, research notes.
- The top-level `internal/` directory is engineering-only and unsupported — design notes, internal compose files, dev harnesses. The wheel never ships it.

## Run with Docker

Three inference servers are supported: vLLM, SGLang, and [arbi-serve](https://github.com/arbi-dev/arbi-serve). Each ships a turn-key Dockerfile. Calibration files for the bit-width / model combo go in a host directory; the `TQKV_CALIBRATION_FILE` env var inside the container points to one. Examples below use `Qwen3.5-0.8B` + a K4V4 calibration.

### Layout assumed

```
/path/to/models/Qwen3.5-0.8B/...                       # HF snapshot
/path/to/calibrations/qwen3.5-0.8b_tq4_v3.json         # calibration bundle
```

All three accept the same CLI flags: `--kv-cache-dtype tqkv --attention-backend turbo-attn` plus `TQKV_BITS=<float>` and `TQKV_CALIBRATION_FILE=<path>`. `TQKV_BITS` is the *average* bits-per-element across K and V (e.g. `4.0`, `5.0`, `6.0`); a per-layer Lagrangian solver turns that target into a per-layer `(k_bits, v_bits)` allocation that lives in the calibration bundle. `TQKV_BITS=4.0` does *not* mean "K and V both at 4 bits" — it means "average 4 bits-per-element under the smart per-layer allocation".

### Sibling-checkout layout

All three Dockerfiles `COPY` from a sibling-repo layout. Clone the relevant repos as siblings of `turbo-attn/`:

```
GIT/
├── turbo-attn/        # this repo
├── vllm-fork/         # arbi-dev/vllm        (only needed for vLLM image)
├── sglang-fork/       # arbi-dev/sglang      (only needed for SGLang image)
└── arbi-serve/        # arbi-dev/arbi-serve  (only needed for arbi-serve image)
```

```bash
mkdir -p ~/GIT && cd ~/GIT
git clone https://github.com/arbi-dev/turbo-attn
git clone https://github.com/arbi-dev/vllm       vllm-fork    # for vLLM
git clone https://github.com/arbi-dev/sglang     sglang-fork  # for SGLang
git clone https://github.com/arbi-dev/arbi-serve              # for arbi-serve
```

### vLLM

Uses our [`vllm-fork`](https://github.com/arbi-dev/vllm) rebased onto upstream `v0.20.1` (small overlay — `CacheDType` Literal relaxation, per-group block-pool bookkeeping, named `TURBO_ATTN` slot in `AttentionBackendEnum`; full layout in [`docker/PATCHES.md`](docker/PATCHES.md)).

```bash
cd ~/GIT/turbo-attn/docker

# build + run
TQKV_MODELS_ROOT=/path/to/models \
  docker compose -f compose.vllm.yaml up -d --build

docker compose -f compose.vllm.yaml logs -f

# serve a request once "Application startup complete" appears:
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'
```

Optional env (override on the `docker compose` command line):

| Variable | Default | Purpose |
|---|---|---|
| `TQKV_MODEL` | `/models/Qwen3.5-0.8B` | Container-side model path |
| `TQKV_BITS` | `4.0` | Average bits-per-element target |
| `TQKV_CALIBRATION_FILE` | unset | Path to a v4 calibration bundle. Unset → uniform K4V4 fallback (tests/CI only; production needs a bundle) |
| `TQKV_PORT` | `8000` | Host port |
| `TQKV_GPU_DEVICE` | `0` | `NVIDIA_VISIBLE_DEVICES` |
| `TQKV_MAX_MODEL_LEN` | `2048` | Max context length |

### SGLang

Uses our [`sglang-fork`](https://github.com/arbi-dev/sglang) rebased onto upstream `v0.5.11` (small overlay — plugin registries for KV-cache dtypes and attention backends; full layout in [`docker/PATCHES.md`](docker/PATCHES.md)).

```bash
cd ~/GIT/turbo-attn/docker

TQKV_MODELS_ROOT=/path/to/models \
  docker compose -f compose.sglang.yaml up -d --build

docker compose -f compose.sglang.yaml logs -f

curl http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'
```

Same env-var contract as vLLM (drop `TQKV_MAX_MODEL_LEN`; SGLang uses `TQKV_CONTEXT_LEN` and `TQKV_MEM_FRAC` for `--mem-fraction-static`, default `0.45`).

### arbi-serve

Standalone OpenAI-compatible server with TQKV backends as a first-class citizen. Lives in [arbi-dev/arbi-serve](https://github.com/arbi-dev/arbi-serve).

```bash
cd ~/GIT/arbi-serve

ARBI_MODELS_ROOT=/path/to/models \
  docker compose up -d --build

docker compose logs -f

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3.5-0.8B", "prompt": "The capital of France is", "max_tokens": 12, "temperature": 0}'
```

### MLA models

For DeepSeek V2/V3/V4, additionally set `-e TQKV_MLA_ENABLE=1` on whichever container.

### Calibration

Pre-built calibration bundles for common models live in HuggingFace at [`arbi-dev/turbo-attn-calibrations`](https://huggingface.co/arbi-dev/turbo-attn-calibrations). To roll your own:

```bash
python -m tqkv.calibration.calibrate_centroids \
    --model Qwen/Qwen3.5-0.8B \
    --output qwen3.5-0.8b_tq4_v3.json \
    --bits 4
```

## How it works

1. **Rotate** each KV vector with a fast Walsh–Hadamard transform.
2. **Normalize** — store the magnitude as a single BF16 value.
3. **Quantize** each rotated coordinate to a shared codebook.

Attention scores on rotated KV are bit-identical to attention on unrotated KV when the query is rotated by the same matrix; we pre-rotate Q once per request and compute everything in the rotated space.

The decode path is one fused CUDA kernel that unpacks, dequantizes, and runs Q·K, online softmax, and P·V in a single pass — no decompress buffer. Prefill has three paths: an FA4 CuTeDSL subclass that dequantizes inline during the MMA pipeline (default), a hand-written CUDA C++ kernel, and a decompress + stock FlashAttention fallback.

## Configuration

All runtime configuration is via `TQKV_`-prefixed environment variables. The supported surface is below; anything unlisted is internal and may change.

### Bit width and calibration

| Variable | Default | Description |
|---|---|---|
| `TQKV_BITS` | `4.0` | *Average* bits-per-element target across K and V (float in `[2.0, 8.0]`). The runtime looks up the calibration bundle's `byte_budget_table[<TQKV_BITS>]` for the per-layer `(k_bits, v_bits)` allocation. Hard error if the entry is missing — no silent fallback. |
| `TQKV_CALIBRATION_FILE` | `""` | Path to a schema-v4 calibration bundle (centroids + per-channel scales + `byte_budget_table`). Required for production; bundle generation: `calibrate_centroids` → `optimize_quant` → `migrate_v3_to_v4`. When unset, the plugin falls back to uniform K4V4 (tests/CI only). |
| `TQKV_AUTO_CALIBRATE_MODEL` | `""` | Model path for plugin-side auto-calibration when `TQKV_CALIBRATION_FILE` doesn't exist on first init. |

### Engine selection

| Variable | Default | Description |
|---|---|---|
| `TQKV_ENGINE` | `""` (auto) | Decode engine: `native_tq`, `flash_attn`, or `bypass`. |
| `TQKV_PREFILL_ENGINE` | `fa4` | Prefill path: `fa4`, `adaptive`, or `decomp_fa_main_only` (bench-only — main-token prefill through decomp+FA; decode + MTP verify untouched). |
| `TQKV_PREFILL_BYPASS` | `1` | First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode. |
| `TQKV_FUSE_QROT` | `""` (auto) | Fused Q-rotation prologue. Decode-only. |
| `TQKV_O_PROJ_FOLD` | `on` | Fold `rotate_output` into `o_proj` weights. |
| `TQKV_MTP_SPLITK` | `1` | Use split-K decode kernel for MTP layers. |
| `TQKV_DECODE_SPLITS` | `""` (autotune) | Force decode-kernel split count. |

### Backend behaviour

| Variable | Default | Description |
|---|---|---|
| `TQKV_NO_JIT` | `0` | Fail if a kernel variant is not pre-compiled. |
| `TQKV_K_NC` | `1` | Apply norm-correction to K reads in the dequant path. |
| `TQKV_DISABLE_PRESCALE` | `0` | Disable per-channel pre-scaling on compress upload. |
| `TQKV_STRICT_NO_SDPA` | `0` | Raise instead of taking the `head_dim>256` SDPA fallback. Recommended for `head_dim>256` deployments. |

### MLA (DeepSeek V2/V3/V4)

| Variable | Default | Description |
|---|---|---|
| `TQKV_MLA_ENABLE` | `0` | Master switch for the MLA backend. |
| `TQKV_MLA_ROPE_HEAD_DIM` | `64` | RoPE head dimension for MLA latent + RoPE split. |

## Why a vLLM fork (for now)

`CacheDType` in `vllm/config/cache.py` is a Pydantic `Literal` validated at class-definition time, which blocks runtime registration of new KV-cache dtypes. Until that's relaxed upstream, we ship a fork. The fork is a thin overlay; full layout in [`docker/PATCHES.md`](docker/PATCHES.md). SGLang does not need a fork.

## Citation

If Turbo Attention helps your work, please cite both the underlying TurboQuant paper and this implementation:

```bibtex
@misc{turbo_attention2026,
  title = {Turbo Attention: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo-attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}
```

## License

Mozilla Public License 2.0 (MPL-2.0). See `LICENSE` and `NOTICE`.
