Metadata-Version: 2.4
Name: mlx-optiq
Version: 0.0.10
Summary: Mixed-precision quantization optimizer for MLX models on Apple Silicon
Author: Thin Signal
License: MIT
Project-URL: Models, https://huggingface.co/collections/mlx-community
Keywords: mlx,quantization,mixed-precision,apple-silicon,llm,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.20
Requires-Dist: mlx-lm>=0.20
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: huggingface-hub
Provides-Extra: convert
Requires-Dist: torch>=2.0; extra == "convert"
Requires-Dist: transformers>=4.40; extra == "convert"
Requires-Dist: safetensors; extra == "convert"
Requires-Dist: tqdm; extra == "convert"
Requires-Dist: datasets; extra == "convert"
Provides-Extra: yolo
Requires-Dist: yolo-mlx>=0.2; extra == "yolo"
Requires-Dist: pillow; extra == "yolo"
Provides-Extra: vlm
Requires-Dist: mlx-vlm>=0.3; extra == "vlm"
Requires-Dist: pillow; extra == "vlm"
Provides-Extra: audio
Requires-Dist: mlx-whisper>=0.4; extra == "audio"
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: psutil; extra == "cli"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: mlx-vlm>=0.3; extra == "all"
Requires-Dist: mlx-whisper>=0.4; extra == "all"
Requires-Dist: click>=8.0; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pillow; extra == "all"

# mlx-optiq

**Optimized deployment of LLMs, VLMs, and vision models on Apple Silicon.**

**Website:** https://mlx-optiq.pages.dev/ &nbsp;|&nbsp; **PyPI:** https://pypi.org/project/mlx-optiq/ &nbsp;|&nbsp; **Models:** https://huggingface.co/models?other=optiq

OptIQ is an optimizing compiler and runtime for MLX. It takes a full-precision model and turns it into the best version for a given memory/latency budget on your Mac, using per-layer sensitivity measurements instead of "uniform 4-bit everywhere". The same sensitivity signal drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter swapping.

```bash
pip install mlx-optiq
```

## Why mlx-optiq

Stock mlx-lm treats every layer of a quantized model the same. OptIQ doesn't:

- **Some layers are 50× more sensitive to quantization than others.** OptIQ measures this once per model and assigns bits per-layer, holding the same average bits-per-weight while cutting quality loss. On GSM8K, this recovers **+15–32 percentage points** over uniform 4-bit on the same model, same quant budget ([results](https://mlx-optiq.pages.dev/results.html)).
- **The same is true of the KV cache.** Some attention layers' KV are catastrophic to quantize (layer 0 KV is ~56× more sensitive than the median), others are essentially lossless at 4-bit. `optiq serve` runs a per-layer KV quant pipeline that keeps your quality while cutting decode memory — **up to +62% decode tok/s at 64k context vs fp16 KV** on M3 Max.
- **LoRA fine-tuning should reuse that sensitivity signal too.** `optiq lora train` assigns higher adapter rank to layers OptIQ identified as sensitive, and lower rank to robust ones — so your adapter budget goes where it helps most.
- **Multi-adapter serving shouldn't reload the base model every time.** `optiq serve` implements reversible mounted LoRA: mount multiple adapters on one base, switch per-request via a ContextVar-isolated activation gate, all without touching the frozen base weights.

Plus everything a deployment framework actually needs: vision-stripping for pure-text variants of VLMs, TurboQuant rotated-space KV compression (research path), YOLO26 quantization for object detection, and a roofline latency model calibrated to Apple Silicon bandwidth.

## The full stack at a glance

| Feature | CLI | Description |
|---|---|---|
| Weight quantization | `optiq convert` | Per-layer sensitivity + greedy knapsack. Auto-strips vision/audio metadata when quantizing multi-modal base models for text-only use. |
| KV cache quantization | `optiq kv-cache` | Writes per-layer `kv_config.json`. Same sensitivity method applied to the attention cache. |
| TurboQuant compression | `optiq.core.turbo_quant` | Rotation + optimal Lloyd-Max scalar quantization. Library API for research/custom pipelines. |
| OpenAI-compatible server | `optiq serve` | Drop-in `mlx_lm.server` replacement with mixed-precision KV, mounted LoRA adapters, and `--adapter <HF repo id>` auto-download. |
| Sensitivity-aware LoRA | `optiq lora train` | Per-layer rank scaling from OptIQ's bit assignments. PEFT-compatible output + OptIQ sidecar metadata. |
| Mounted hot-swap adapters | `optiq.adapters.mount` | Reversible per-request adapter activation via ContextVar. N adapters co-resident with one base. |
| VLM → text-only | `optiq convert --strip-unused-modalities` | Drops vision/audio weights + config cleanup. Output routes through `gemma4_text` / `qwen3_5_text` instead of the VLM wrapper. |
| YOLO26 quantization | `optiq convert --model-type yolo` | Full pipeline including per-layer detection-output KL sensitivity. Outputs a `yolo-mlx` compatible model. |
| Latency prediction | `optiq latency` | Roofline model calibrated to Apple Silicon memory bandwidth. Predicts decode tok/s for a given model + bit layout before running it. |
| Benchmarking | `optiq benchmark` / `optiq eval` | GSM8K, AI2D, WER, perplexity. Side-by-side vs baselines. |

## Quickstart

### Running a pre-built OptIQ model

Every [OptIQ-tagged model on HuggingFace](https://huggingface.co/models?other=optiq) works out of the box with stock `mlx-lm`:

```python
from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=50))
```

**Installing mlx-optiq unlocks the rest** — mixed-precision KV serving, LoRA fine-tuning, runtime hot-swap adapters. Bit-identical inference quality either way.

### Serving with mixed-precision KV

```bash
# One-time per-layer KV sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20
```

### Fine-tuning with sensitivity-aware LoRA

```bash
# Train a LoRA adapter. --rank-scaling by_bits assigns rank proportional
# to OptIQ's per-layer bit assignments: 8-bit layers get 2× the rank of
# 4-bit layers, at the same average.
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect what was adapted (shows per-layer rank distribution)
optiq lora info ./my_adapter
```

Adapter output is **PEFT-compatible** (`adapter_config.json` + `adapters.safetensors`) plus an OptIQ sidecar (`optiq_lora_config.json`) that records per-layer rank. Loads with any PEFT tool.

### Serving with hot-swap adapters

```bash
# Preload an adapter at startup (HF repo id or local path)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter

# ...or use the mount API directly for multi-adapter serving:
#   mount N adapters on one base, switch per request in the same Python
#   process without reloading the model. See optiq/adapters/mount.py.
```

### Converting a fresh model

```bash
pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit
```

For multi-modal base models quantized for text-only deployment, OptIQ auto-strips vision/audio metadata by default. Pass `--keep-unused-modalities` to disable.

### YOLO26

```python
from optiq.models.yolo import run_yolo_pipeline
results = run_yolo_pipeline(
    "optiq_output/yolo26n.safetensors",
    "optiq_output/yolo26n_optiq",
)
```

Full pipeline: per-layer sensitivity on detection outputs → greedy knapsack → `yolo-mlx`-compatible output.

## Headline numbers

**Weight quantization on GSM8K** — OptIQ vs uniform 4-bit (same avg BPW):

| Model | Uniform-4b | **OptIQ-4b** | Δ |
|---|---:|---:|---:|
| Qwen3.5-0.8B | 11.5% | **27.0%** | +15.5pp |
| gemma-4-e4b-it | 23.5% | **55.5%** | +32.0pp |

**Decode throughput at 64k context** — `optiq serve` mixed-precision KV vs fp16 on M3 Max 36GB:

| Model | fp16 tok/s | **OptIQ tok/s** | Δ |
|---|---:|---:|---:|
| Qwen3.5-2B | 27.9 | **41.8** | +50% |
| Qwen3.5-4B | 8.1 | **13.1** | +62% |
| Qwen3.5-9B | 20.7 | **27.1** | +31% |

Full tables + methodology: [Results page](https://mlx-optiq.pages.dev/results.html).

## How it works

**Weight quantization pipeline** (`optiq convert`):

1. Load the base model from HuggingFace via PyTorch.
2. For each linear layer × each candidate bit-width, simulate quantization and measure KL divergence between full-precision and quantized logits on a calibration set (WikiText-2 for LLMs, COCO-captions for VLMs).
3. Greedy knapsack: start every layer at the minimum bits, upgrade the layer with the best "KL-reduction-per-bit" ratio each step until the target BPW budget is spent. Protected layers (`lm_head`, `embed_tokens`, first/last transformer blocks) always get the max bit-width.
4. MLX conversion via `mlx-lm.convert()` with the per-layer `quant_predicate` from step 3.
5. For multi-modal base models, strip vision/audio metadata from config.json (auto; opt out with `--keep-unused-modalities`).

**KV cache pipeline** (`optiq kv-cache` + `optiq serve`):

1. Same sensitivity measurement but applied to the KV cache: for each full-attention layer, replace that layer's KV with a quantized copy and measure KL on held-out prompts.
2. `optiq serve` monkey-patches `mlx_lm.server.stream_generate` to use `mlx_lm.models.cache.QuantizedKVCache` at per-layer bit-widths (via a patched `maybe_quantize_kv_cache` hook).
3. At attention time, `mx.quantized_matmul` reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel is faster than the 4-bit kernel, so protecting one sensitive layer at 8-bit gives both quality AND a throughput bump.

**TurboQuant KV** (`optiq.core.turbo_kv_cache` — research path):

1. Random orthogonal rotation makes coordinates near-independent → better-conditioned quantization.
2. Optimal Lloyd-Max scalar quantization per coordinate (1/2/3/4-bit centroid tables).
3. Rotated-space attention: rotate Q once and output once, work in centroid space in between. Attention cost stays O(seq × d); rotation is O(d²) fixed.

**Sensitivity-aware LoRA** (`optiq lora train`):

1. Read `optiq_metadata.json.per_layer` — OptIQ's per-layer bit assignment.
2. Per target linear, derive rank: `rank_scaling=by_bits` gives `r = base_rank × (bits / 4)`, so 8-bit layers get 2× the rank of 4-bit at the same base.
3. Apply mounted LoRA across all target blocks with the per-layer rank.
4. Train via `mlx_lm.tuner.trainer.train`, with the `mx.compile` decorator monkey-patched out to avoid a known Metal OOM on 9B-class models.
5. Save in PEFT-compatible format + OptIQ sidecar recording the per-layer rank distribution.

**Mounted LoRA hot-swap** (`optiq.adapters.mount`):

1. `prepare_model_for_mounted_lora` walks every transformer block and wraps each target linear (`q_proj`, `v_proj` by default) in a `MountedLoRALinear` that holds a dict of `{adapter_id: (A, B, scale)}` plus the frozen base linear.
2. `mount_adapter_on_model(model, adapter_id, adapter_dir)` loads adapter weights off disk and registers them on every MountedLoRALinear.
3. At inference time, a `ContextVar` decides which adapter is active. `None` → base only. `with AdapterActivation("A"):` → forward pass adds adapter A's residual.
4. ContextVar semantics mean concurrent asyncio tasks / threads with different active adapters don't step on each other.

## When you need mlx-optiq vs bare mlx-lm

| Scenario | Bare `mlx-lm` | `mlx-optiq` |
|---|:---:|:---:|
| Load + generate from an OptIQ HF model | ✅ | ✅ |
| Mixed-precision KV cache at serve time | — | ✅ |
| LoRA fine-tuning that uses OptIQ's sensitivity data | — | ✅ |
| Hot-swappable adapters in one serving process | — | ✅ |
| Fresh conversion of a new base model with OptIQ | — | ✅ |
| TurboQuant research pipelines | — | ✅ |
| YOLO26 quantization | — | ✅ |

For pure inference on published OptIQ models: bare mlx-lm is enough and gets bit-identical output. For everything else, install mlx-optiq.

## Hybrid-attention note

Qwen3.5 interleaves **linear-attention** (GatedDeltaNet) and **full-attention** layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. `optiq kv-cache` skips the linear layers automatically. On Qwen3.5-4B/9B, you end up with 8 of 32 layers getting per-layer KV bit assignments, typically `7 @ 4-bit + 1 @ 8-bit` protecting layer 3 (the first full-attention layer).

## Status / roadmap

- ✅ Weight quantization: production (Qwen3.5 family 0.8B–27B + Qwen3.6-27B since v0.0.10)
- ✅ KV-cache serving (Qwen3.5): production since v0.0.5
- ✅ Sensitivity-aware LoRA + mounted hot-swap: production since v0.0.8
- ✅ VLM-to-text metadata stripping: production since v0.0.8
- ✅ Anthropic `/v1/messages` endpoint shim (`optiq.anthropic_server`): opt-in since v0.0.10
- ✅ YOLO26 pipeline: production
- 🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting `QuantizedKVCache`
- 🚧 Per-request adapter routing in the HTTP layer: mount/swap API is production; HTTP `X-OptIQ-Adapter` header plumbing is next
- 🔬 Long-context training kernels (`optiq.ops`, since v0.0.10): flash attention forward+backward, chunked CE, fused SwiGLU. Opt-in only — vanilla mlx-lm + `grad_checkpoint=True` is faster up to T = 12 000 on 36 GB.
- 🔬 TurboQuant serving path with a fused Metal kernel: research

## Article

[Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon](https://x.com/thin_signal/status/2028412948167942334)

## Requirements

- Python ≥ 3.11
- Apple Silicon Mac (for MLX)
- `mlx-lm ≥ 0.30`

## License

MIT
