Metadata-Version: 2.4
Name: mlx-optiq
Version: 0.0.7
Summary: Mixed-precision quantization optimizer for MLX models on Apple Silicon
Author: Thin Signal
License: MIT
Project-URL: Models, https://huggingface.co/collections/mlx-community
Keywords: mlx,quantization,mixed-precision,apple-silicon,llm,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.20
Requires-Dist: mlx-lm>=0.20
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: huggingface-hub
Provides-Extra: convert
Requires-Dist: torch>=2.0; extra == "convert"
Requires-Dist: transformers>=4.40; extra == "convert"
Requires-Dist: safetensors; extra == "convert"
Requires-Dist: tqdm; extra == "convert"
Requires-Dist: datasets; extra == "convert"
Provides-Extra: yolo
Requires-Dist: yolo-mlx>=0.2; extra == "yolo"
Requires-Dist: pillow; extra == "yolo"
Provides-Extra: vlm
Requires-Dist: mlx-vlm>=0.3; extra == "vlm"
Requires-Dist: pillow; extra == "vlm"
Provides-Extra: audio
Requires-Dist: mlx-whisper>=0.4; extra == "audio"
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: psutil; extra == "cli"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: mlx-vlm>=0.3; extra == "all"
Requires-Dist: mlx-whisper>=0.4; extra == "all"
Requires-Dist: click>=8.0; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pillow; extra == "all"

# mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

**Website:** https://mlx-optiq.pages.dev/ &nbsp;|&nbsp; **PyPI:** https://pypi.org/project/mlx-optiq/

OptIQ turns "uniform 4-bit" into a data-driven, per-layer budget. Sensitive layers stay at 8-bit; the rest get 4-bit. The same per-layer sensitivity signal runs across the full deployment stack:

- **weight quantization** (`optiq convert`)
- **KV-cache quantization** at serving time (`optiq kv-cache`, `optiq serve`)
- **TurboQuant** rotated-space state compression (`optiq.core.turbo_kv_cache`)
- **unused-component stripping** — for multi-modal base models, drop vision / audio metadata when the target is text-only
- **sensitivity-aware LoRA fine-tuning** (new in v0.0.7) — layers OptIQ identified as sensitive get higher adapter rank than robust ones, since they benefit more from adaptation capacity
- **OpenAI-compatible server** with optional LoRA adapter loading direct from a HuggingFace repo id

Everything ships behind `optiq *` subcommands and drops into stock `mlx-lm` at serve time.

## Install

```bash
pip install mlx-optiq
```

## What you get

| | What it does | Where |
|---|---|---|
| **`optiq convert`** | Per-layer sensitivity analysis + mixed-precision weight quantization. For multi-modal base models, auto-strips unused vision/audio metadata to produce a clean text-only OptIQ variant. | [Experiments &rarr;](https://mlx-optiq.pages.dev/experiments.html) |
| **`optiq kv-cache`** | Per-layer KV-cache sensitivity. Writes `kv_config.json` with per-layer bit-widths. | [Experiments &rarr;](https://mlx-optiq.pages.dev/experiments.html) |
| **`optiq serve`** | OpenAI-compatible HTTP server with mixed-precision KV + `--adapter` flag that accepts a HuggingFace LoRA repo id directly. Drop-in `mlx_lm.server` replacement. | [Results &rarr;](https://mlx-optiq.pages.dev/results.html) |
| **`optiq lora train`** _(new in v0.0.7)_ | Sensitivity-aware LoRA fine-tuning on OptIQ models. Reads per-layer bit assignments from `optiq_metadata.json` and scales LoRA rank by layer sensitivity. PEFT-compatible adapter output. | README ↓ |
| **`optiq.core.turbo_kv_cache`** | TurboQuant rotated-space KV (library). Research path for attention-inner-product-preserving quantization. | [Experiments &rarr;](https://mlx-optiq.pages.dev/experiments.html) |

Pre-built OptIQ-quantized models on HuggingFace: [Models &rarr;](https://mlx-optiq.pages.dev/models.html)

## Quickstart

**Use a pre-built model (stock mlx-lm, no OptIQ code required):**

```python
from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Hello", max_tokens=100)
```

**Serve with mixed-precision KV (new in v0.0.5):**

```bash
# One-time sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20
```

**Convert a new model:**

```bash
pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit
```

**Sensitivity-aware LoRA fine-tuning (new in v0.0.7):**

```bash
# Train a LoRA adapter with per-layer rank derived from OptIQ's
# sensitivity measurements (by_bits scaling: 8-bit layers get 2× rank)
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect what was adapted
optiq lora info ./my_adapter

# Serve with the adapter applied
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter \
            --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json

# Or serve with a community adapter direct from HF (auto-downloads)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter codelion/my-agent-lora
```

The adapter is saved in PEFT-compatible format (`adapter_config.json` + `adapters.safetensors`) plus an OptIQ sidecar (`optiq_lora_config.json`) that records the per-layer rank distribution. You can load it with any tool that speaks PEFT.

## Headline numbers

**Weight quantization** — GSM8K vs uniform 4-bit:
- Qwen3.5-0.8B: 27.0% vs 11.5% (**+15.5pp**)
- gemma-4-e4b-it: 55.5% vs 23.5% (**+32.0pp**)

**KV-cache serving** — decode tok/s at 64k context (Apple M3 Max 36GB):
- Qwen3.5-2B: 41.8 vs fp16 27.9 (**+50%**)
- Qwen3.5-4B: 13.1 vs fp16 8.1 (**+62%**)
- Qwen3.5-9B: 27.1 vs fp16 20.7 (**+31%**)

Full tables, methodology, and per-layer configs on the [Results page](https://mlx-optiq.pages.dev/results.html).

## How it works

**Weight quantization pipeline:**
1. Load PyTorch model from HuggingFace.
2. Per-layer KL-divergence sensitivity on calibration data (WikiText-2 for LLMs).
3. Greedy knapsack: start all layers at min-bit, upgrade by KL-reduction-per-bit until BPW budget is spent. Sensitive layers like `lm_head`, `embed_tokens`, and the first/last blocks are protected at the max bit-width.
4. MLX conversion via `mlx-lm` with per-layer `quant_predicate`.

**KV-cache serving pipeline:**
1. `optiq kv-cache` runs the same sensitivity analysis but on KV quantization — per full-attention layer, measures KL divergence when that layer's KV is quantized to each candidate bit-width.
2. `optiq serve` monkey-patches `mlx_lm.server`'s generation loop to use `mlx_lm.models.cache.QuantizedKVCache` at per-layer bit-widths (via `maybe_quantize_kv_cache`). The patched hook replaces mlx-lm's uniform `kv_bits` with a per-layer dict.
3. At SDPA time, `mx.quantized_matmul` reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel path is faster than the 4-bit one, so protecting a single sensitive layer at 8-bit gives both quality preservation *and* a decode speedup.

**TurboQuant KV** (research path — not default in `optiq serve`):
1. Random orthogonal rotation makes coordinates near-independent.
2. Optimal Lloyd-Max scalar quantization per coordinate.
3. Rotated-space attention: rotate Q once and output once, work in centroid space in between. O(d²) fixed cost vs O(seq_len × d²) for naïve rotated-then-dequant.

## Hybrid-attention note

Qwen3.5 interleaves **linear-attention** (GatedDeltaNet) and **full-attention** layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. `optiq kv-cache` automatically skips the linear layers. On Qwen3.5-4B/9B, this means 8 of 32 layers get per-layer KV bit assignments; the typical output is `7 @ 4-bit + 1 @ 8-bit` protecting layer 3 (the first full-attention layer).

## Status / roadmap

- ✅ Weight quantization: production
- ✅ KV cache serving (Qwen3.5): production in v0.0.5
- 🚧 Gemma-4 KV serving: blocked on upstream `mlx-lm` shared-KV attention not supporting `QuantizedKVCache`
- 🔬 TurboQuant serving with a fused Metal kernel: research

## Article

[Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon](https://x.com/thin_signal/status/2028412948167942334)

## Requirements

- Python ≥ 3.11
- Apple Silicon Mac (for MLX)
- `mlx-lm ≥ 0.20`

## License

MIT
