Metadata-Version: 2.4
Name: mlx-optiq
Version: 0.2.3
Summary: Mixed-precision quantization optimizer for LLMs on Apple Silicon (MLX)
Author: mlx-optiq
License: MIT
Project-URL: Models, https://huggingface.co/collections/mlx-community
Keywords: mlx,quantization,mixed-precision,apple-silicon,llm,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.20
Requires-Dist: mlx-lm>=0.31.3
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: huggingface-hub
Provides-Extra: convert
Requires-Dist: torch>=2.0; extra == "convert"
Requires-Dist: transformers>=4.40; extra == "convert"
Requires-Dist: safetensors; extra == "convert"
Requires-Dist: tqdm; extra == "convert"
Requires-Dist: datasets; extra == "convert"
Requires-Dist: psutil; extra == "convert"
Provides-Extra: vlm
Requires-Dist: mlx-vlm>=0.3; extra == "vlm"
Requires-Dist: pillow; extra == "vlm"
Provides-Extra: audio
Requires-Dist: mlx-whisper>=0.4; extra == "audio"
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: psutil; extra == "cli"
Provides-Extra: lab
Requires-Dist: flask>=3.0; extra == "lab"
Requires-Dist: argon2-cffi>=23.0; extra == "lab"
Requires-Dist: pyjwt>=2.8; extra == "lab"
Requires-Dist: cryptography>=42.0; extra == "lab"
Requires-Dist: data-designer; extra == "lab"
Requires-Dist: ddgs>=9.0; extra == "lab"
Requires-Dist: html2text>=2024.0; extra == "lab"
Requires-Dist: pypdf>=4.0; extra == "lab"
Requires-Dist: docx2txt>=0.8; extra == "lab"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: mlx-vlm>=0.3; extra == "all"
Requires-Dist: mlx-whisper>=0.4; extra == "all"
Requires-Dist: click>=8.0; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pillow; extra == "all"
Requires-Dist: flask>=3.0; extra == "all"
Requires-Dist: argon2-cffi>=23.0; extra == "all"
Requires-Dist: pyjwt>=2.8; extra == "all"
Requires-Dist: cryptography>=42.0; extra == "all"
Requires-Dist: data-designer; extra == "all"
Requires-Dist: ddgs>=9.0; extra == "all"
Requires-Dist: html2text>=2024.0; extra == "all"
Requires-Dist: pypdf>=4.0; extra == "all"
Requires-Dist: docx2txt>=0.8; extra == "all"

# mlx-optiq

**Quantize, fine-tune and serve LLMs entirely on Apple Silicon.**

**Website:** https://mlx-optiq.com &nbsp;|&nbsp; **Docs:** https://mlx-optiq.com/docs/ &nbsp;|&nbsp; **Models:** https://mlx-optiq.com/models &nbsp;|&nbsp; **PyPI:** https://pypi.org/project/mlx-optiq/ &nbsp;|&nbsp; **HF org:** https://huggingface.co/mlx-community

mlx-optiq is an optimizing compiler and runtime for MLX. It takes a full-precision model and turns it into the best version for a given memory/latency budget on your Mac, using per-layer sensitivity measurements instead of "uniform 4-bit everywhere". The same sensitivity signal drives every layer of the stack: weights, KV cache, LoRA fine-tuning, runtime adapter swapping.

```bash
pip install mlx-optiq
```

## What's in v0.1.0

- **Twelve pre-built OptIQ quants** on Hugging Face. Qwen3.5, Qwen3.6, Gemma-4 families. **Every one beats stock uniform 4-bit on the six-metric Capability Score.**
- **Six-benchmark Capability Score** on every model card: MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop (long-context retrieval). [Methodology](https://mlx-optiq.com/blog/eval-framework).
- **Speculative decoding everywhere.** Qwen3.5 / Qwen3.6 quants ship a bundled MTP head (`mtp.safetensors`) for ~1.4× decode via `optiq serve --mtp`. Gemma-4 quants pair with a small `-assistant-bf16` drafter via `optiq serve --drafter`.
- **DPO fine-tuning** via `optiq lora train --method dpo`. Same LoRA infrastructure, with a reference forward pass through the adapter at scale=0 (no second model load).
- **OptIQ Lab** (`pip install "mlx-optiq[lab]"` → `optiq lab`). Local web UI for chat, quantize, fine-tune, dataset construction, and live model swap. See screenshots below.

## Why mlx-optiq

Stock mlx-lm treats every layer of a quantized model the same. mlx-optiq doesn't:

- **Some layers are 50× more sensitive to quantization than others.** mlx-optiq measures this once per model and assigns bits per-layer at the same average bits-per-weight. Every shipped quant beats stock uniform 4-bit on the six-metric Capability Score; the gains range from +0.17 (Qwen3.5-27B) to +13.57 (gemma-4-e4b-it).
- **The same is true of the KV cache.** Some attention layers' KV are catastrophic to quantize (layer 0 KV is ~56× more sensitive than the median), others are essentially lossless at 4-bit. `optiq serve` runs a per-layer KV quant pipeline that keeps quality while cutting decode memory: **up to +62 % decode tok/s at 64 k context vs fp16 KV** on M3 Max.
- **LoRA fine-tuning should reuse that sensitivity signal too.** `optiq lora train` assigns higher adapter rank to sensitive layers and lower rank to robust ones, so adapter capacity goes where it helps most. Supports both SFT and DPO.
- **One server, two protocols.** `optiq serve` exposes both the OpenAI `/v1/chat/completions` and Anthropic `/v1/messages` endpoints from the same process. Point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain `curl` at the same local URL.
- **Multi-adapter serving shouldn't reload the base model every time.** `optiq serve` implements reversible mounted LoRA: keep N adapters resident on one base, switch per request via a ContextVar-isolated activation gate.

Plus a roofline latency model calibrated to Apple Silicon bandwidth for hardware-aware optimization.

## The full stack at a glance

| Feature | CLI | Docs |
|---|---|---|
| Weight quantization | `optiq convert` | [How sensitivity works](https://mlx-optiq.com/docs/sensitivity) · [CLI](https://mlx-optiq.com/docs/cli#convert) |
| KV cache quantization | `optiq kv-cache` | [KV-quant serving](https://mlx-optiq.com/docs/serve) · [CLI](https://mlx-optiq.com/docs/cli#kv-cache) |
| OpenAI + Anthropic server | `optiq serve` | [KV-quant serving](https://mlx-optiq.com/docs/serve) · [Anthropic API](https://mlx-optiq.com/docs/serve#anthropic) |
| MTP / drafter spec decoding | `optiq serve --mtp` · `--drafter` | [MTP guide](https://mlx-optiq.com/docs/mtp) |
| Sensitivity-aware LoRA (SFT + DPO) | `optiq lora train` | [LoRA fine-tuning](https://mlx-optiq.com/docs/finetune) · [Blog](https://mlx-optiq.com/blog/sensitivity-aware-lora) |
| Mounted hot-swap adapters | `optiq.adapters.mount` | [Adapters](https://mlx-optiq.com/docs/serve#adapters) |
| Eval suite (6 benchmarks + score) | `optiq eval --task all --score` | [Eval framework](https://mlx-optiq.com/blog/eval-framework) |
| Local web UI | `optiq lab` | [OptIQ Lab](https://mlx-optiq.com/docs/lab) |
| Latency prediction | `optiq latency` | [CLI](https://mlx-optiq.com/docs/cli#latency) |

**Per-family getting-started guides:** [Qwen3.5](https://mlx-optiq.com/docs/qwen3.5) · [Qwen3.6](https://mlx-optiq.com/docs/qwen3.6) · [Gemma-4](https://mlx-optiq.com/docs/gemma-4)

## Quickstart

### Running a pre-built mlx-optiq model

Every [mlx-optiq quant on Hugging Face](https://huggingface.co/mlx-community) works out of the box with stock `mlx-lm`:

```python
from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=50))
```

**Installing mlx-optiq unlocks the rest:** mixed-precision KV serving, LoRA fine-tuning (SFT + DPO), runtime hot-swap adapters, the dual-protocol server, MTP / drafter speculative decoding, and the local web UI. Bit-identical inference quality either way.

### Serving with speculative decoding

```bash
# Qwen: the bundled MTP head gives ~1.4× decode via OptiqEngine
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit --mtp

# Gemma-4: pair with the matching -assistant-bf16 drafter
optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \
            --drafter mlx-community/gemma-4-31B-it-assistant-bf16
```

`--mtp` and `--drafter` are mutually exclusive. Both expose the same OpenAI + Anthropic endpoints.

### Serving with mixed-precision KV (and Claude Code)

```bash
# One-time per-layer KV sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o ./kv/qwen35_9b

# Server on :8080. Speaks BOTH OpenAI and Anthropic APIs.
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --kv-config ./kv/qwen35_9b/kv_config.json \
  --max-tokens 32768 --temp 0.6 --top-p 0.95
```

Drive it from Claude Code by setting two env vars:
```bash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
claude    # now driven by your local quant
```

Or any OpenAI-compatible client:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used")
```

### Fine-tuning with sensitivity-aware LoRA

**Supervised fine-tuning (SFT) — the default:**

```bash
# --rank-scaling by_bits assigns rank proportional to mlx-optiq's per-layer
# bit assignments: 8-bit layers get 2× the rank of 4-bit layers at the
# same average parameter budget.
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

optiq lora info ./my_adapter   # per-layer rank distribution
```

**DPO from preference pairs:**

```bash
# Data shape: one {"prompt": ..., "chosen": ..., "rejected": ...} per line.
# Reference forward pass runs through the same adapter with scale=0 — no
# second model load, no extra memory.
optiq lora train mlx-community/Qwen3.5-0.8B-OptiQ-4bit \
    --data ./dpo_pairs --method dpo --dpo-beta 0.1 \
    --preset small --iters 200 -o ./my_dpo_adapter
```

Adapter output is **PEFT-compatible** (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution. Loads with any PEFT tool.

### Serving with hot-swap adapters

```bash
# Preload an adapter at startup (HF repo id or local path)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter
```

### Converting a fresh model

```bash
pip install "mlx-optiq[convert]"
optiq convert Qwen/Qwen3.5-9B --target-bpw 5.0 --candidate-bits 4,8
optiq eval ./optiq_output/Qwen3.5-9B/optiq_mixed --task all --score
```

For models too large to fit in RAM as bf16, the default `--reference auto` builds a uniform-4-bit baseline first, then streams bf16 weights off disk one layer at a time to compute the calibration-driven sensitivity. This lets 27 B+ models still get a meaningful per-layer signal on a 36 GB Mac.

## OptIQ Lab — local web UI

`pip install "mlx-optiq[lab]"` then `optiq lab`. Four workflow surfaces, one Flask process, password-gated, localhost-only.

**Chat** — streaming playground with web search, Python and bash tools in a three-tier sandbox.

![OptIQ Lab — chat tab](https://mlx-optiq.com/assets/lab-screens/chat.png)

**Quantize** — paste an HF model id, slide a target-BPW dial, watch live sensitivity + knapsack progress, one-click push to your HF account.

![OptIQ Lab — quantize wizard](https://mlx-optiq.com/assets/lab-screens/quantize.png)

**Fine-tune** — SFT or DPO on any OptIQ quant, live train-loss sparkline, save + push to HF. DPO beta and method selection in the hyperparams step.

![OptIQ Lab — fine-tune with DPO selected](https://mlx-optiq.com/assets/lab-screens/finetune-dpo.png)

**Build dataset** — twelve templates: SFT from QA pairs, DPO preference pairs, style transfer, code completion, self-instruct expansion, RAG QA, multi-turn chat, and more. Outputs JSONL the fine-tune wizard reads directly.

![OptIQ Lab — build dataset](https://mlx-optiq.com/assets/lab-screens/build-dataset.png)

**Model server** — load any local or HF OptIQ quant, toggle MTP, auto-suggest the matching Gemma-4 `-assistant` drafter, edit sampling overrides.

![OptIQ Lab — server settings with drafter auto-suggest](https://mlx-optiq.com/assets/lab-screens/settings-drafter.png)

Full guide: [OptIQ Lab docs](https://mlx-optiq.com/docs/lab).

## Headline numbers

**Capability Score (6-benchmark mean of MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop), mlx-optiq vs uniform 4-bit:**

| Model | Uniform-4 | **mlx-optiq** | Δ |
|---|---:|---:|---:|
| Qwen3.5-0.8B | 31.73 | **36.00** | +4.27 |
| Qwen3.5-2B | 45.54 | **47.66** | +2.12 |
| Qwen3.5-4B | 63.86 | **65.76** | +1.90 |
| Qwen3.5-9B | 66.58 | **66.77** | +0.19 |
| Qwen3.5-27B | 78.88 | **79.05** | +0.17 |
| Qwen3.5-35B-A3B | 73.75 | **74.17** | +0.42 |
| Qwen3.6-27B | 82.50 | **82.96** | +0.46 |
| Qwen3.6-35B-A3B | 75.67 | **76.78** | +1.12 |
| gemma-4-e2b-it | 51.09 | **53.21** | +2.12 |
| gemma-4-e4b-it | 52.28 | **65.84** | **+13.57** |
| gemma-4-26B-A4B-it | 69.62 | **72.68** | +3.06 |
| gemma-4-31B-it | 76.23 | **79.69** | +3.47 |

**12 of 12 quants beat uniform 4-bit on the 6-metric Capability Score.** Full per-benchmark breakdowns: each model card on [mlx-community](https://huggingface.co/mlx-community). Methodology: [eval-framework write-up](https://mlx-optiq.com/blog/eval-framework).

**Decode throughput at 64 k context**, `optiq serve` mixed-precision KV vs fp16 on M3 Max 36 GB:

| Model | fp16 tok/s | **mlx-optiq tok/s** | Δ |
|---|---:|---:|---:|
| Qwen3.5-2B | 27.9 | **41.8** | +50 % |
| Qwen3.5-4B | 8.1 | **13.1** | +62 % |
| Qwen3.5-9B | 20.7 | **27.1** | +31 % |

## How it works

**Weight quantization pipeline** (`optiq convert`):

1. Resolve the base model (Hugging Face snapshot, or a local path).
2. Build the running model. Use bf16 if it fits in RAM, else a uniform-4-bit MLX baseline with bf16 weights streamed from disk per probe.
3. For each linear layer × each candidate bit-width, simulate quantization and measure KL divergence between reference and quantized logits on the bundled `optiq.jsonl` calibration mix (40 samples across prose, reasoning, code, agent, tool-call, and constraint-bearing instruction domains; chat-template auto-applied).
4. Greedy knapsack: start every layer at the minimum bits, upgrade the layer with the best "KL-reduction-per-bit" ratio each step until the target BPW budget is spent. Protected layers (`lm_head`, `embed_tokens`, first/last transformer blocks) always get the max bit-width.
5. MLX conversion via `mlx_lm.convert()` with the per-layer `quant_predicate` from step 4.

**KV cache pipeline** (`optiq kv-cache` + `optiq serve`):

1. Same sensitivity measurement applied to the KV cache: for each full-attention layer, replace that layer's KV with a quantized copy and measure KL on held-out prompts.
2. `optiq serve` patches `mlx_lm.server` to use `mlx_lm.models.cache.QuantizedKVCache` at per-layer bit-widths.
3. At attention time, `mx.quantized_matmul` reads packed KV directly. No fp16 materialization. On Apple Silicon, the 8-bit kernel is faster than the 4-bit kernel, so protecting one sensitive layer at 8-bit gives both quality AND a throughput bump.

**Sensitivity-aware LoRA** (`optiq lora train`):

1. Read `optiq_metadata.json.per_layer`, mlx-optiq's per-layer bit assignment.
2. Per target linear, derive rank: `rank_scaling=by_bits` gives `r = base_rank × (bits / 4)`, so 8-bit layers get 2× the rank of 4-bit at the same base.
3. SFT path: train via `mlx_lm.tuner.trainer.train` with rank scaled per layer.
4. DPO path: standalone loop with reference logprobs from the same model at adapter scale=0. Standard DPO loss with `dpo_beta` controlling the KL penalty.
5. Save in PEFT-compatible format + mlx-optiq sidecar recording the per-layer rank distribution.

**Speculative decoding** (`optiq serve --mtp` / `--drafter`):

- **MTP (Qwen3.5 / 3.6):** `optiq convert` preserves DeepSeek-V3-style MTP head tensors as a `mtp.safetensors` sidecar registered in `config.json`. `--mtp` enables ~1.4× decode via the bundled head; depth 1 is optimal on Apple Silicon.
- **Drafter (Gemma-4):** `--drafter mlx-community/<name>-assistant-bf16` loads a small companion model. γ=1 greedy speculation through `optiq.runtime.spec`; tokens are bit-identical to non-spec decode.

**Mounted LoRA hot-swap** (`optiq.adapters.mount`):

1. `prepare_model_for_mounted_lora` walks every transformer block and wraps each target linear (`q_proj`, `v_proj` by default) in a `MountedLoRALinear` that holds a dict of `{adapter_id: (A, B, scale)}` plus the frozen base linear.
2. `mount_adapter_on_model(model, adapter_id, adapter_dir)` loads adapter weights off disk and registers them on every MountedLoRALinear.
3. At inference, a `ContextVar` decides which adapter is active. `None` → base only. `with AdapterActivation("A"):` → forward pass adds adapter A's residual.
4. ContextVar semantics mean concurrent asyncio tasks / threads with different active adapters don't step on each other.

## When you need mlx-optiq vs bare mlx-lm

| Scenario | Bare `mlx-lm` | `mlx-optiq` |
|---|:---:|:---:|
| Load + generate from an mlx-optiq HF model | ✅ | ✅ |
| Mixed-precision KV cache at serve time |  | ✅ |
| Anthropic-API endpoint (`/v1/messages`) for Claude Code |  | ✅ |
| MTP / drafter speculative decoding |  | ✅ |
| LoRA fine-tuning that uses the per-layer sensitivity data |  | ✅ |
| DPO fine-tuning (preference learning) |  | ✅ |
| Hot-swappable adapters in one serving process |  | ✅ |
| Local web UI for chat / quantize / fine-tune / dataset |  | ✅ |
| Six-benchmark eval suite + Capability Score |  | ✅ |
| Fresh conversion of a new base model |  | ✅ |

For pure inference on published mlx-optiq models, bare mlx-lm is enough and gets bit-identical output. For everything else, install mlx-optiq.

## Hybrid-attention note

Qwen3.5 interleaves **linear-attention** (GatedDeltaNet) and **full-attention** layers on a 4:1 ratio. Only 1 in 4 layers has a KV cache. `optiq kv-cache` skips the linear layers automatically. On Qwen3.5-4B/9B, you end up with 8 of 32 layers getting per-layer KV bit assignments, typically `7 @ 4-bit + 1 @ 8-bit` protecting layer 3 (the first full-attention layer).

## Resources

**Site:** [mlx-optiq.com](https://mlx-optiq.com): overview, models, docs, blog.

**Documentation:**
- [Installation](https://mlx-optiq.com/docs/install) · [Using mlx-optiq quants](https://mlx-optiq.com/docs/quants) · [OptIQ Lab](https://mlx-optiq.com/docs/lab)
- [How sensitivity works](https://mlx-optiq.com/docs/sensitivity) · [LoRA fine-tuning](https://mlx-optiq.com/docs/finetune) · [KV-quant serving](https://mlx-optiq.com/docs/serve) · [MTP speculation](https://mlx-optiq.com/docs/mtp)
- Family guides: [Qwen3.5](https://mlx-optiq.com/docs/qwen3.5) · [Qwen3.6](https://mlx-optiq.com/docs/qwen3.6) · [Gemma-4](https://mlx-optiq.com/docs/gemma-4)
- [CLI reference](https://mlx-optiq.com/docs/cli) · [Models index](https://mlx-optiq.com/models) · [FAQ](https://mlx-optiq.com/docs/faq)

**For agents and IDEs:** [llms.txt](https://mlx-optiq.com/llms.txt). The entire library reference in one Markdown file you can drop into Claude Code, Cursor, or any agent context.

**Blog:**
- [The eval framework that drives every quant we ship](https://mlx-optiq.com/blog/eval-framework)
- [The calibration mix and what it caught](https://mlx-optiq.com/blog/calibration-mix)
- [MTP speculative decoding on Apple Silicon](https://mlx-optiq.com/blog/mtp-on-apple-silicon)
- [Gemma-4 speculative decoding with the assistant drafter](https://mlx-optiq.com/blog/gemma-spec-decoding)
- [Why u4 KV cache OOMs harder than fp16](https://mlx-optiq.com/blog/tight-ram-kv-quant)
- [Sensitivity-aware LoRA: fine-tuning that respects the bit budget](https://mlx-optiq.com/blog/sensitivity-aware-lora)
- [Lab chat tools: sandboxed Python, bash, web search](https://mlx-optiq.com/blog/lab-chat-tools)
- [Gemma-4 lands on mlx-optiq](https://mlx-optiq.com/blog/gemma-4-support)
- [Not All Layers Are Equal: the research foundation](https://mlx-optiq.com/blog/not-all-layers-are-equal)

## Requirements

- Python ≥ 3.11
- Apple Silicon Mac (for MLX)
- `mlx-lm ≥ 0.30`

## License

MIT
