Metadata-Version: 2.4
Name: turboquant-py
Version: 0.1.0
Summary: Python implementation of the TurboQuant and QJL vector quantization algorithms
Author: msilverblatt
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/msilverblatt/turboquant-py
Project-URL: Repository, https://github.com/msilverblatt/turboquant-py
Project-URL: Issues, https://github.com/msilverblatt/turboquant-py/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Provides-Extra: examples
Requires-Dist: scikit-learn>=1.7; extra == "examples"
Requires-Dist: sentence-transformers>=5.0; extra == "examples"
Dynamic: license-file

# turboquant-py

**Python implementation of the TurboQuant and QJL vector quantization algorithms.**

turboquant-py implements the TurboQuant and QJL vector quantization algorithms from Google Research (ICLR 2026 / AISTATS 2026). It compresses high-dimensional floating-point vectors to 1-4 bits per coordinate while preserving inner products and distances with provably near-optimal distortion. The library offers two quantization modes — MSE mode for reconstruction fidelity and inner-product mode for unbiased similarity search — plus a standalone 1-bit QJL quantizer, all built on a NumPy-first core with optional PyTorch acceleration.

## Installation

```bash
pip install turboquant-py
# With PyTorch acceleration (optional — supports CUDA and Apple Silicon MPS)
pip install "turboquant-py[torch]"

# From source
pip install -e .
pip install -e ".[torch]"
```

## Quick Start

### TurboQuant MSE mode

```python
import numpy as np
from turboquant import TurboQuant

vectors = np.random.randn(1000, 384)  # 1000 vectors, dim=384
tq = TurboQuant(dim=384, bit_width=2, mode="mse", seed=42)

compressed = tq.quantize(vectors)
reconstructed = tq.dequantize(compressed)

mse = float(np.mean((vectors - reconstructed) ** 2))
print(f"Reconstruction MSE: {mse:.6f}")

# Save and reload
compressed.save("my_index")
from turboquant import CompressedVectors
reloaded = CompressedVectors.load("my_index")
```

### TurboQuant inner product mode

```python
import numpy as np
from turboquant import TurboQuant

db = np.random.randn(10000, 768)
query = np.random.randn(768)

tq = TurboQuant(dim=768, bit_width=3, seed=42)  # inner_product mode is the default
compressed = tq.quantize(db)

# Estimate inner products against all compressed database vectors
scores = tq.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")
```

### QJL 1-bit quantization

```python
import numpy as np
from turboquant import QJL

db = np.random.randn(10000, 1536)
query = np.random.randn(1536)

qjl = QJL(dim=1536, seed=42)
compressed = qjl.quantize(db)

scores = qjl.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")
```

## API Reference

### `TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)`

Multi-bit vector quantizer supporting MSE and inner-product modes.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `dim` | `int` | required | Input vector dimensionality |
| `bit_width` | `int` | required | Bits per coordinate (1-4; inner-product mode requires 2+) |
| `mode` | `str` | `"inner_product"` | `"mse"` for reconstruction quality; `"inner_product"` for unbiased similarity search |
| `seed` | `int \| None` | `None` | Random seed for the rotation matrix and QJL projection |
| `outlier_channels` | `int` | `0` | Number of high-magnitude channels to quantize at higher precision |
| `outlier_bit_width` | `int \| None` | `None` | Bit-width for outlier channels |

**Methods:**
- `quantize(vectors)` — compress an `(n, dim)` array; returns `CompressedVectors`
- `dequantize(compressed)` — reconstruct approximate originals; returns `(n, dim)` array
- `inner_product(query, compressed)` — estimate inner products; returns `(n,)` scores
- `quantize_batched(vectors, batch_size, output_path, entropy_encode)` — stream-quantize large collections to disk

---

### `QJL(dim, projection_dim, seed)`

1-bit quantizer using the Quantized Johnson-Lindenstrauss transform. Applies a random Gaussian projection followed by sign-bit quantization to produce an unbiased inner-product estimator with no reconstruction capability.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `dim` | `int` | required | Input vector dimensionality |
| `projection_dim` | `int \| None` | `dim` | Projection dimension (must be <= dim); higher values improve accuracy at the cost of storage |
| `seed` | `int \| None` | `None` | Random seed for the projection matrix |

**Methods:**
- `quantize(vectors)` — compress to 1-bit sign representation; returns `CompressedVectors`
- `inner_product(query, compressed)` — estimate inner products; returns `(n,)` scores

---

### `CompressedVectors`

In-memory container for quantized vectors. Holds the bit-packed quantization indices, per-vector L2 norms, and any auxiliary arrays (QJL signs, residual norms for inner-product mode). Supports slicing with `[start:end]` and merging with `CompressedVectors.concatenate(parts)`.

**Save/load:**
```python
compressed.save("path/to/dir")                          # bit-packed on disk
compressed.save("path/to/dir", entropy_encode=True)     # Huffman-encoded (~5% smaller at 4-bit)
loaded = CompressedVectors.load("path/to/dir")
loaded = CompressedVectors.load("path/to/dir", mmap_mode="r")  # memory-mapped
```

**Key attributes:** `indices`, `norms`, `dim`, `bit_width`, `num_vectors`, `metadata`, `extra_arrays`

---

### `CompressedStore`

On-disk vector store backed by memory-mapped arrays. Reconstructs the original quantizer from saved metadata (using the stored seed to regenerate rotation/projection matrices) and supports brute-force top-k search without loading all vectors into RAM.

```python
store = CompressedStore.load("path/to/dir")
results = store.search(query, k=10)  # returns list[tuple[int, float]], sorted descending by score
```

Supports all modes: MSE, inner-product, QJL, and outlier configurations are all reconstructed automatically from the saved metadata.

**Properties:** `dim`, `num_vectors`, `bit_width`, `mode`, `vectors`

---

### `compute_codebook(dim, bit_width)` / `get_codebook(dim, bit_width)`

`compute_codebook` runs Lloyd-Max optimization on the Beta distribution that describes coordinates of randomly rotated unit vectors, returning `(centroids, boundaries)` arrays of sizes `2^bit_width` and `2^bit_width + 1`.

`get_codebook` is an `lru_cache`-wrapped convenience wrapper around `compute_codebook`. Precomputed codebooks are shipped for common dimensions (64-4096); unsupported dimensions are computed on-the-fly.

---

### `compute_theoretical_savings(dim, bit_width)`

Compute the theoretical entropy and Huffman coding savings for a given configuration, as described in Section 3.1 of the TurboQuant paper.

```python
from turboquant import compute_theoretical_savings
savings = compute_theoretical_savings(dim=256, bit_width=4)
# {'entropy': 3.742, 'avg_bits_huffman': 3.779, 'savings_pct': 5.5}
```

## Storage

On-disk format uses a directory containing:

| File | Contents |
|------|----------|
| `meta.json` | Dimensions, bit-width, mode, seed, outlier config, encoding flags |
| `indices.npy` | Bit-packed quantization indices (or `indices.huffman` if entropy-encoded) |
| `norms.npy` | Per-vector L2 norms |
| `huffman_table.json` | Huffman coding table (only when entropy-encoded) |
| Additional `.npy` files | QJL sign vectors, residual norms, outlier indices (mode-dependent) |

Rotation and projection matrices are not stored — they are reconstructed deterministically from the saved seed, reducing storage overhead.

### Entropy encoding

Indices can optionally be Huffman-encoded when saving, providing lossless compression. Savings are measured vs fixed-width encoding (not vs bit-packing):

| Bit-width | Shannon entropy | Huffman avg bits | Savings |
|-----------|----------------|-----------------|---------|
| 1 | 1.000 | 1.000 | 0.0% |
| 2 | 1.911 | 1.989 | 0.5% |
| 3 | 2.819 | 2.876 | 4.1% |
| 4 | 3.742 | 3.779 | 5.5% |

```python
compressed.save("my_index", entropy_encode=True)
# Loading is automatic — detects encoding from metadata
loaded = CompressedVectors.load("my_index")
```

## Supported Bit-Widths

Results on synthetic unit vectors (dim=768, n=1000).

| Bit-width | Compression ratio | Reconstruction MSE | Recall@1 (MSE mode) | Recall@10 (MSE mode) |
|---|---|---|---|---|
| 1 | 32x | 0.000473 | 0.19 | 0.65 |
| 2 | 16x | 0.000153 | 0.46 | 0.91 |
| 3 | 10.7x | 0.000045 | 0.67 | 0.99 |
| 4 | 8x | 0.000013 | 0.73 | 1.00 |

## Benchmarks

All results use `all-MiniLM-L6-v2` embeddings (dim=384, n=2000 database, n=200 queries, seed=42).

### MSE distortion and recall on neural embeddings

| Method | Bit-width | MSE | Recall@1 | Recall@10 |
|---|---|---|---|---|
| NaiveUniform | 2 | 0.001079 | 0.675 | 0.721 |
| NaiveUniform | 3 | 0.000195 | 0.830 | 0.850 |
| NaiveUniform | 4 | 0.000042 | 0.895 | 0.913 |
| TurboQuant-mse | 2 | 0.000305 | 0.755 | 0.817 |
| TurboQuant-mse | 3 | 0.000090 | 0.895 | 0.878 |
| TurboQuant-mse | 4 | 0.000025 | 0.895 | 0.918 |
| TurboQuant-inner_product | 2 | 0.000946 | 0.605 | 0.713 |
| TurboQuant-inner_product | 3 | 0.000305 | 0.760 | 0.820 |
| TurboQuant-inner_product | 4 | 0.000090 | 0.895 | 0.881 |
| QJL (1-bit) | 1-4 | — | 0.580 | 0.703 |

TurboQuant-mse achieves 3.5x lower MSE than naive uniform quantization at 2 bits. At 4 bits, TurboQuant-inner_product matches QJL recall while storing structured residuals that enable better approximation.

### Comparison vs. uniform quantization at 2 bits (dim=384)

| Method | MSE | Recall@1 | Recall@10 |
|---|---|---|---|
| NaiveUniform | 0.000856 | 0.30 | 0.80 |
| PerChannelUniform | 0.001032 | 0.29 | 0.82 |
| RandProj+Uniform | 0.001027 | 0.28 | 0.76 |
| TurboQuant-mse | 0.000304 | 0.44 | 0.95 |

TurboQuant-mse reduces MSE by 2.8x over NaiveUniform and improves Recall@10 from 80% to 95% at the same 2-bit budget.

## Acceleration

All core operations are implemented in NumPy. When PyTorch is installed, matrix operations (rotation, projection, batch inner products) dispatch to PyTorch tensors automatically.

| Backend | Device detection | Notes |
|---------|-----------------|-------|
| NumPy (default) | Always available | Fastest on CPU for most workloads |
| PyTorch + CUDA | Auto-detected | Best for large-scale GPU workloads |
| PyTorch + MPS | Auto-detected | Apple Silicon GPU; uses float32 internally (MPS does not support float64) |

On CPU, NumPy and PyTorch perform similarly. GPU acceleration benefits primarily the matrix operations (rotation, projection); the scalar quantization is already vectorized in NumPy.

## How It Works

**TurboQuant MSE mode** applies a random orthogonal rotation to each input vector before scalar quantization. Because coordinates of a randomly rotated unit vector follow a known Beta distribution, a precomputed Lloyd-Max codebook can be derived analytically for that distribution rather than estimated from data. Lloyd-Max quantization is provably optimal for a fixed scalar quantizer, and the rotation ensures coordinates match the distribution the codebook was designed for. Rotating back after dequantization reconstructs the original vector with near-optimal MSE.

**TurboQuant inner-product mode** extends MSE quantization to produce an unbiased inner-product estimator. It quantizes each vector at `(b-1)` bits using the MSE codebook, computes the residual between the original and the MSE reconstruction, and then applies the QJL transform (random Gaussian projection followed by sign extraction) to that residual. The stored representation consists of the MSE quantization indices plus the sign bits and L2 norm of the residual. At query time, the inner product estimate combines the MSE dot product and a QJL correction term scaled by `sqrt(pi/2) / d`, which exactly cancels the bias introduced by the MSE quantizer. Using one bit for QJL on the residual and `(b-1)` bits for MSE thus achieves better inner-product accuracy than spending all `b` bits on MSE quantization alone.

**QJL (Quantized Johnson-Lindenstrauss)** is a 1-bit scheme that projects each key vector through a random Gaussian matrix `S` and stores only the sign vector `sign(S·k)` along with the vector norm. For a query `q`, the inner product is estimated as `sqrt(pi/2) / m * ||k|| * <S·q, sign(S·k)>`, where `m` is the projection dimension. This estimator is unbiased and requires only 1 bit per projected coordinate, making it suitable for extreme compression where reconstruction is not needed.

## References

- **TurboQuant:** "TurboQuant: Redefining AI Efficiency with Extreme Compression" — [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
- **QJL:** Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AISTATS 2026) — [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
- **PolarQuant:** [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
