Metadata-Version: 2.4
Name: tqf
Version: 0.1.0
Summary: TurboQuant — PolarQuant KV cache compression for LLM inference
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.1
Provides-Extra: codebook
Requires-Dist: scipy>=1.10; extra == "codebook"
Provides-Extra: triton
Requires-Dist: triton>=3.0; extra == "triton"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: scipy>=1.10; extra == "dev"
Requires-Dist: triton>=3.0; extra == "dev"

# TurboKV

Compressed KV cache for LLM serving. 4-8x memory reduction with minimal quality loss.

## The Problem

When serving an LLM, every token the model has seen is stored in a **KV cache** that grows with conversation length. Each token stores two vectors (Key and Value) per layer in bf16 (2 bytes per number).

For a 27B model at 250K context, the KV cache alone can consume **40+ GB of GPU memory**. This is the main bottleneck for long conversations, large batches, and smaller GPUs.

## How It Works

TurboKV compresses KV cache values from 16 bits down to 2-4 bits — a **4-8x memory reduction**.

The compression pipeline:

1. **Rotate** the vector with a fast Hadamard transform (spreads information evenly across all dimensions)
2. **Normalize** (store the vector's magnitude separately as a single float)
3. **Quantize** each dimension to one of a small set of values (4 values for TQ2, 16 for TQ4)

The rotation is the key insight. Raw KV values have uneven distributions — some dimensions carry more information than others. After rotation, every dimension follows the same Gaussian distribution, so a single shared codebook achieves near-optimal quantization.

## Repository Architecture

The system spans three repositories, each handling a different concern:

### turbokv (this repo) — The Compression Library

Framework-agnostic core. Doesn't know about vLLM or FlashInfer.

- **Codec** — compress/decompress API (`TurboKVCodec`)
- **Codebook computation** — optimal quantization levels for Gaussian data via Lloyd's algorithm
- **CUDA and Triton kernels** — fast compress/decompress with auto-selection (CUDA > Triton > PyTorch)
- **Calibration** — per-layer centroid optimization from real model data (for models that deviate from Gaussian)
- **Header-only C++ library** — `include/turbokv/*.cuh` can be included in any CUDA project

### [flashinfer-fork](https://github.com/arbi-dev/flashinfer) — Attention Kernels

FlashInfer is a GPU kernel library for attention computation. The fork adds:

- **Fused decode kernel** — reads compressed KV cache directly with no decompression step. Unpacks bits, looks up centroids, and computes attention in one pass. Runs at 771 GB/s (near hardware ceiling on RTX 5090).
- **TQ prefill loader** — decompresses KV inline during FlashInfer's existing optimized prefill kernel. No custom prefill attention needed.

### [vllm-fork](https://github.com/arbi-dev/vllm) — Serving Engine Integration

vLLM is the production serving engine. The fork adds:

- **TurboKV attention backend** — manages compressed KV cache pages
- **CUDAGraph decode** — zero CPU overhead, fully graph-captured (385 tok/s on RTX 5090)
- **Three engine paths** — `native_tq` (fused), `flash_attn` (decompress+FA), `flashinfer` (decompress+FI)

## Key Architecture Decisions

### Why three repos?

Each has a different rate of change and different maintainers. Keeping them separate means a vLLM upgrade doesn't break the kernels, and a kernel optimization doesn't require re-testing the serving engine. Thin forwarding shims (`shims/`) let the forks import from turbokv without code duplication.

### Why a fused decode kernel?

Decode is **memory-bound** — the GPU mostly waits for data, not computes. Decompressing to a buffer first means 3x the memory traffic (read compressed, write decompressed, read again for attention). The fused kernel reads compressed data once and computes in-place. With TQ4, it reads 4x less data than bf16. Result: **2x faster decode**.

### Why NOT a fused prefill kernel?

Prefill is **compute-bound** — the GPU is busy doing matrix multiplication on tensor cores, not waiting for memory. Reading 4x less data doesn't help when compute is the bottleneck. We tested a standalone tensor-core prefill kernel and it was 3-10x slower than FlashInfer at realistic lengths. Instead, we plug TQ decompression into FlashInfer's prefill as a custom data loader — FlashInfer handles tiling and pipelining, we just provide decompressed values.

### Why are the codebooks Gaussian?

The fast Hadamard transform makes post-rotation KV values converge to a Gaussian distribution (central limit theorem). We verified this empirically — on Qwen3.5, the excess kurtosis is -0.07 (nearly perfect Gaussian). Lloyd's algorithm on Gaussian data gives fixed optimal centroids, so no per-model calibration is needed. The calibration infrastructure exists for models that might deviate, but Qwen3.5 doesn't need it.

## Performance

On RTX 5090 (24GB) with Qwen3.5-0.8B, TQ4:

| Metric | Value |
|--------|-------|
| Decode throughput | 385 tok/s (CUDAGraph) |
| Decode bandwidth | 771 GB/s (43% of peak) |
| Prefill throughput | 15K+ tok/s (FlashInfer inline dequant) |
| KV cache compression | 4x (TQ4) to 8x (TQ2) |
| Max context tested | 250K tokens (needle-in-haystack passing) |
| Quality (TQ4) | cosine similarity > 0.999 vs bf16 |

## Quick Start

```bash
pip install -e .
```

```python
from turbokv import TurboKVCodec

codec = TurboKVCodec(head_dim=128, bit_width=4, device="cuda")

# Compress
k_packed, k_norms = codec.compress_k(key_vectors)   # bf16 -> 4-bit packed
v_packed, v_norms = codec.compress_v(value_vectors)

# Decompress
k_recon = codec.decompress_k(k_packed, k_norms)     # 4-bit packed -> bf16
```

For serving with vLLM, see the [vllm-fork README](https://github.com/arbi-dev/vllm/tree/turbokv-kv-cache).

## Project Structure

```
turbokv/
    codec.py                 # TurboKVCodec API
    core.py                  # Codebook, rotation, bit packing primitives
    calibrate_online.py      # Lloyd's per-layer optimization
    calibrate_static.py      # Scale/shift calibration
    page_metadata.py         # CUDAGraph-safe page metadata (CUDA + Triton)
    kernels/
        triton_kernels.py    # Triton compress/decompress (2-8 bit)
        cuda/
            include/turbokv/   # Header-only C++ library
            standalone_kernels/   # Fused decode kernels (warpspec, wmma)
shims/
    flashinfer/              # Forwarding headers for flashinfer-fork
    vllm/                    # Forwarding imports for vllm-fork
scripts/
    calibrate_model.py       # Generate per-layer calibration from model weights
    gen_codebooks.py         # Pre-compute codebook tables
tests/
    test_codec.py            # Roundtrip, rotation orthogonality
    test_kernels.py          # CUDA + Triton kernel correctness
```

## Supported Configurations

| Bit Width | Centroids | Compression | Use Case |
|-----------|-----------|-------------|----------|
| TQ2 | 4 | 8x | Maximum compression, slight quality trade-off |
| TQ3 | 8 | 5.3x | Balanced |
| TQ4 | 16 | 4x | Recommended default, negligible quality loss |

Supported head dimensions: 64, 128, 256, 512 (zero-padded to next power of 2 for WHT).

## License

Apache 2.0
