Metadata-Version: 2.4
Name: turbo-attn
Version: 0.1.2
Summary: Optimized CUDAgraph-enabled kernels and attention backend for vLLM, SGLang and more based on TurboQuant near-lossless KV cache compression. SOTA performance with Gemma 4, Qwen 3.6 and other modern LLMs.
Author-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
Maintainer-email: "Dmitri Evseev (Arbi City)" <dmitri.evseev@arbi.city>
License: MPL-2.0
Project-URL: Homepage, https://github.com/arbi-dev/turbo_attn
Project-URL: Repository, https://github.com/arbi-dev/turbo_attn
Project-URL: Issues, https://github.com/arbi-dev/turbo_attn/issues
Project-URL: Changelog, https://github.com/arbi-dev/turbo_attn/blob/main/CHANGELOG.md
Keywords: kv-cache,quantization,vllm,sglang,flash-attention,llm-inference,transformers,cuda
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=2.1
Provides-Extra: codebook
Requires-Dist: scipy>=1.10; extra == "codebook"
Provides-Extra: triton
Requires-Dist: triton>=3.0; extra == "triton"
Provides-Extra: flashinfer
Requires-Dist: flashinfer>=0.6; extra == "flashinfer"
Provides-Extra: vllm
Requires-Dist: vllm>=0.19; extra == "vllm"
Provides-Extra: flash-attn
Requires-Dist: flash-attn>=2.5; extra == "flash-attn"
Provides-Extra: eval
Requires-Dist: lm-eval>=0.4.5; extra == "eval"
Requires-Dist: ray; extra == "eval"
Requires-Dist: datasets; extra == "eval"
Requires-Dist: langdetect; extra == "eval"
Requires-Dist: immutabledict; extra == "eval"
Requires-Dist: nltk; extra == "eval"
Requires-Dist: sacrebleu; extra == "eval"
Requires-Dist: absl-py; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: scipy>=1.10; extra == "dev"
Requires-Dist: triton>=3.0; extra == "dev"
Provides-Extra: all
Requires-Dist: turbo-attn[codebook,eval,flash-attn,flashinfer,triton,vllm]; extra == "all"
Dynamic: license-file

# Turbo Attention

[![CI](https://github.com/arbi-dev/turbo_attn/actions/workflows/ci.yml/badge.svg)](https://github.com/arbi-dev/turbo_attn/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/turbo-attn.svg)](https://pypi.org/project/turbo-attn/)
[![License](https://img.shields.io/badge/license-MPL--2.0-blue.svg)](LICENSE)

A modular attention backend for vLLM, SGLang, and HuggingFace Transformers. Custom CUDA + Triton kernels with full CUDAGraph capture, asymmetric K/V quantization, hybrid-model support. Built on FlashAttention; based on TurboQuant near-lossless KV cache compression.

PyPI: `turbo-attn` · Import: `tqkv` · License: MPL-2.0

## Install

```bash
pip install turbo-attn                  # codec + CUDA/Triton kernels
pip install "turbo-attn[vllm]"          # + vLLM attention backend
pip install "turbo-attn[all]"           # + SGLang, FlashInfer, flash-attn, eval harness
```

## Quickstart

```python
import torch
from tqkv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")
keys = torch.randn(8, 128, device="cuda")

packed, norms = codec.compress_k(keys)
recon = codec.decompress_k(packed, norms)
```

See [`examples/`](examples/) for runnable snippets and
[ARCHITECTURE.md](ARCHITECTURE.md) for a codebase tour.

## Repo layout

- `tqkv/` — the package (codec, kernels, runtime, vLLM/SGLang plugins, calibration pipeline).
- `docs/`, `docker/`, `scripts/`, `examples/`, `experiments/` — public docs, deploy recipes, helper scripts, runnable examples, research notes.
- Anything under an `internal/` subdirectory is engineering-only and unsupported. The wheel never ships these.

## vLLM

vLLM serving currently requires the [vllm fork](https://github.com/arbi-dev/vllm) (`turbo-attn` branch) — it wires `"tqkv"` through vLLM's `CacheDType` Literal and adds per-group block-pool bookkeeping for hybrid models. Patch layout in [`docker/PATCHES.md`](docker/PATCHES.md).

```bash
vllm serve Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend custom
```

For MLA models (DeepSeek V2/V3) set `TQKV_MLA_ENABLE=1`.

## SGLang

No fork required — `tqkv.integrations.sglang.register()` installs a pool factory and wires the `tqkv` attention backend.

```python
import tqkv.integrations.sglang as tqkv_sglang
tqkv_sglang.register()
```

```bash
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-0.8B \
  --kv-cache-dtype tqkv \
  --attention-backend tqkv
```

## How it works

1. **Rotate** each KV vector with a fast Walsh–Hadamard transform.
2. **Normalize** — store the magnitude as a single BF16 value.
3. **Quantize** each rotated coordinate to a shared codebook.

Attention scores on rotated KV are bit-identical to attention on unrotated KV when the query is rotated by the same matrix; we pre-rotate Q once per request and compute everything in the rotated space.

The decode path is one fused CUDA kernel that unpacks, dequantizes, and runs Q·K, online softmax, and P·V in a single pass — no decompress buffer. Prefill has three paths: an FA4 CuTeDSL subclass that dequantizes inline during the MMA pipeline (default), a hand-written CUDA C++ kernel, and a decompress + stock FlashAttention fallback.

## Configuration

All runtime configuration is via `TQKV_`-prefixed environment variables. The supported surface is below; anything unlisted is internal and may change.

### Bit width and calibration

| Variable | Default | Description |
|---|---|---|
| `TQKV_BITS` | `4` | Symmetric K/V bit width (2–8). Falls through to `TQKV_K_BITS`/`TQKV_V_BITS` when those are unset. |
| `TQKV_K_BITS` / `TQKV_V_BITS` | inherits `TQKV_BITS` | Asymmetric K/V override. |
| `TQKV_LAYER_BITS` | `""` | Per-layer override string (e.g. `0:8,8;5:2,4`). |
| `TQKV_CALIBRATION_FILE` | `""` | Path to a calibration JSON bundle from `python -m tqkv.calibration.calibrate_model`. |
| `TQKV_ALLOCATION_FILE` | `""` | Path to a per-layer bit-allocation file from `python -m tqkv.calibration.solve_bits`. |
| `TQKV_AUTO_CALIBRATE_MODEL` | `""` | Model path for plugin-side auto-calibration on first init. |
| `TQKV_PROFILE` | `none` | Calibration profile from the bundle: `lossless`, `balanced`, `aggressive`. |

### Engine selection

| Variable | Default | Description |
|---|---|---|
| `TQKV_ENGINE` | `""` (auto) | Decode engine: `native_tq`, `flash_attn`, or `bypass`. |
| `TQKV_PREFILL_ENGINE` | `fa4` | Prefill path: `fa4` or `adaptive`. |
| `TQKV_PREFILL_BYPASS` | `1` | First-chunk prefill bypass — skip codec on prompt-prefill, then re-rotate to TQ basis for decode. |
| `TQKV_FUSE_QROT` | `""` (auto) | Fused Q-rotation prologue. Decode-only. |
| `TQKV_O_PROJ_FOLD` | `on` | Fold `rotate_output` into `o_proj` weights. |
| `TQKV_MTP_SPLITK` | `1` | Use split-K decode kernel for MTP layers. |
| `TQKV_DECODE_SPLITS` | `""` (autotune) | Force decode-kernel split count. |

### Backend behaviour

| Variable | Default | Description |
|---|---|---|
| `TQKV_NO_JIT` | `0` | Fail if a kernel variant is not pre-compiled. |
| `TQKV_K_NC` | `1` | Apply norm-correction to K reads in the dequant path. |
| `TQKV_DISABLE_PRESCALE` | `0` | Disable per-channel pre-scaling on compress upload. |
| `TQKV_STRICT_NO_SDPA` | `0` | Raise instead of taking the `head_dim>256` SDPA fallback. Recommended for `head_dim>256` deployments. |

### MLA (DeepSeek V2/V3/V4)

| Variable | Default | Description |
|---|---|---|
| `TQKV_MLA_ENABLE` | `0` | Master switch for the MLA backend. |
| `TQKV_MLA_ROPE_HEAD_DIM` | `64` | RoPE head dimension for MLA latent + RoPE split. |

## Why a vLLM fork (for now)

`CacheDType` in `vllm/config/cache.py` is a Pydantic `Literal` validated at class-definition time, which blocks runtime registration of new KV-cache dtypes. Until that's relaxed upstream, we ship a fork. The fork is a thin overlay; full layout in [`docker/PATCHES.md`](docker/PATCHES.md). SGLang does not need a fork.

## Citation

If Turbo Attention helps your work, please cite both the underlying TurboQuant paper and this implementation:

```bibtex
@misc{turbo_attention2026,
  title = {Turbo Attention: Production attention backend for TurboQuant KV cache compression},
  author = {Evseev, Dmitri},
  year = {2026},
  url = {https://github.com/arbi-dev/turbo_attn}
}

@inproceedings{zandieh2026turboquant,
  title = {TurboQuant: Near-optimal KV Cache Quantization for LLM Inference},
  author = {Zandieh, Amir and others},
  booktitle = {ICLR},
  year = {2026}
}
```

## License

Mozilla Public License 2.0 (MPL-2.0). See `LICENSE` and `NOTICE`.
