Metadata-Version: 2.4
Name: mlx-optiq
Version: 0.0.3
Summary: Mixed-precision quantization optimizer for MLX models on Apple Silicon
Author: Thin Signal
License: MIT
Project-URL: Models, https://huggingface.co/collections/mlx-community
Keywords: mlx,quantization,mixed-precision,apple-silicon,llm,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.20
Requires-Dist: mlx-lm>=0.20
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: huggingface-hub
Provides-Extra: convert
Requires-Dist: torch>=2.0; extra == "convert"
Requires-Dist: transformers>=4.40; extra == "convert"
Requires-Dist: safetensors; extra == "convert"
Requires-Dist: tqdm; extra == "convert"
Requires-Dist: datasets; extra == "convert"
Provides-Extra: yolo
Requires-Dist: yolo-mlx>=0.2; extra == "yolo"
Requires-Dist: pillow; extra == "yolo"
Provides-Extra: vlm
Requires-Dist: mlx-vlm>=0.3; extra == "vlm"
Requires-Dist: pillow; extra == "vlm"
Provides-Extra: audio
Requires-Dist: mlx-whisper>=0.4; extra == "audio"
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: psutil; extra == "cli"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: mlx-vlm>=0.3; extra == "all"
Requires-Dist: mlx-whisper>=0.4; extra == "all"
Requires-Dist: click>=8.0; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pillow; extra == "all"

# mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

OptiQ produces **better quantized models** that [mlx-lm](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm) loads natively. It also provides **TurboQuant KV cache** — rotation-based vector quantization that preserves attention inner products better than standard affine quantization.

## Install

```bash
pip install mlx-optiq
```

## What It Does

### 1. Mixed-Precision Weight Quantization

Instead of uniform quantization (all layers at 4-bit), OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths.

**Qwen3.5-0.8B** (GSM8K, 200 samples):

| Model | GSM8K | Size |
|---|---|---|
| **OptiQ mixed (4.5 BPW)** | **27.0%** | 570 MB |
| Uniform 4-bit | 11.5% | 404 MB |

2.3x better accuracy at modest size increase. Models work with standard `mlx-lm` — no special code needed:

```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
```

### 2. TurboQuant KV Cache

Implements rotation-based vector quantization from [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) for KV cache compression. Random orthogonal rotation before scalar quantization preserves the inner products that attention's Q·K^T computation needs.

**Rotated-space attention** eliminates per-key rotation overhead — only the query and output are rotated once each (O(d²)), while all stored keys/values are accessed via cheap centroid lookups (O(d)). Result: **near-zero speed overhead**.

**Qwen3.5-0.8B** (6 self-attention layers):

| Method | PPL | Needle Retrieval | Speed vs FP16 |
|---|---|---|---|
| FP16 KV (reference) | 22.50 | 73% | baseline |
| Affine 4-bit KV | 22.98 | 80% | -0% |
| **TurboQuant 4-bit KV** | **22.87** | **100%** | **-2%** |
| **TurboQuant 3-bit KV** | **23.66** | **100%** | **+4%** |

- TurboQuant 4-bit beats affine on PPL (+0.37 vs +0.48) and needle retrieval (100% vs 80%)
- TurboQuant enables 3-bit KV cache where affine can't (head_dim=256 packing incompatibility)
- GSM8K reasoning preserved: TQ 4-bit gets 32% vs FP16's 30% (50-sample test)

```python
from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache, patch_attention

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention()  # Install rotated-space attention (once)

# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
        )

# Use as normal
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, prompt_cache=cache)
```

## Pre-built Models

Available on HuggingFace (work with standard `mlx-lm`):

- [Qwen3.5-0.8B-OptiQ-4bit](https://huggingface.co/mlx-community/Qwen3.5-0.8B-OptiQ-4bit)
- [Qwen3.5-2B-OptiQ-4bit](https://huggingface.co/mlx-community/Qwen3.5-2B-OptiQ-4bit)
- [Qwen3.5-4B-OptiQ-4bit](https://huggingface.co/mlx-community/Qwen3.5-4B-OptiQ-4bit)
- [Qwen3.5-9B-OptiQ-4bit](https://huggingface.co/mlx-community/Qwen3.5-9B-OptiQ-4bit)

## Convert Your Own Models

```bash
pip install mlx-optiq[convert]

# Mixed-precision quantization
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Evaluate
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit
```

## How It Works

**Weight quantization pipeline:**
1. Load PyTorch model from HuggingFace
2. Per-layer KL divergence sensitivity analysis on calibration data
3. Greedy knapsack optimization to assign bit-widths within BPW budget
4. MLX conversion via mlx-lm with custom per-layer quantization config

**TurboQuant KV cache:**
1. Random orthogonal rotation makes vector coordinates near-independent
2. Optimal Lloyd-Max scalar quantization per coordinate
3. Rotated-space attention: pre-rotate Q, compute SDPA in centroid space, post-rotate output
4. Incremental quantization: only new tokens are processed each step

## Architecture Note (Hybrid Models)

Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied to the KV cache layers only. The recurrent state uses a read-modify-write pattern where quantization errors accumulate — keeping it at FP16 is recommended for generation tasks.

## Article

[Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon](https://x.com/thin_signal/status/2028412948167942334)

## Requirements

- Python >= 3.11
- Apple Silicon Mac (for MLX)
- `mlx-lm >= 0.20`
