Metadata-Version: 2.4
Name: rapid-mlx
Version: 0.7.31
Summary: Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama.
Author: Rapid-MLX contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/raullenchai/Rapid-MLX
Project-URL: Documentation, https://github.com/raullenchai/Rapid-MLX#readme
Project-URL: Repository, https://github.com/raullenchai/Rapid-MLX
Keywords: llm,mlx,apple-silicon,inference,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx>=0.29.0
Requires-Dist: mlx-lm>=0.31.0
Requires-Dist: transformers>=5.0.0
Requires-Dist: tokenizers>=0.19.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.28.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: uvicorn>=0.23.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: argcomplete>=3.6
Requires-Dist: websockets>=12.0
Requires-Dist: openai-harmony>=0.0.8
Provides-Extra: vision
Requires-Dist: mlx-vlm>=0.6.3; extra == "vision"
Requires-Dist: opencv-python>=4.8.0; extra == "vision"
Requires-Dist: torch>=2.3.0; extra == "vision"
Requires-Dist: torchvision>=0.18.0; extra == "vision"
Requires-Dist: pillow>=10.0.0; extra == "vision"
Provides-Extra: dflash
Requires-Dist: mlx-vlm>=0.6.3; extra == "dflash"
Provides-Extra: embeddings
Requires-Dist: mlx-embeddings>=0.1.0; extra == "embeddings"
Provides-Extra: chat
Requires-Dist: gradio>=4.0.0; extra == "chat"
Requires-Dist: pytz>=2024.1; extra == "chat"
Provides-Extra: all
Requires-Dist: mlx-vlm>=0.6.3; extra == "all"
Requires-Dist: opencv-python>=4.8.0; extra == "all"
Requires-Dist: torch>=2.3.0; extra == "all"
Requires-Dist: torchvision>=0.18.0; extra == "all"
Requires-Dist: pillow>=10.0.0; extra == "all"
Requires-Dist: gradio>=4.0.0; extra == "all"
Requires-Dist: pytz>=2024.1; extra == "all"
Requires-Dist: mlx-embeddings>=0.1.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4.0; extra == "vllm"
Provides-Extra: guided
Requires-Dist: outlines[mlxlm]>=1.0.0; extra == "guided"
Provides-Extra: audio
Requires-Dist: mlx-audio>=0.2.9; extra == "audio"
Requires-Dist: sounddevice>=0.4.0; extra == "audio"
Requires-Dist: soundfile>=0.12.0; extra == "audio"
Requires-Dist: scipy>=1.10.0; extra == "audio"
Requires-Dist: numba>=0.57.0; extra == "audio"
Requires-Dist: tiktoken>=0.5.0; extra == "audio"
Requires-Dist: misaki[ja,zh]>=0.5.0; extra == "audio"
Requires-Dist: spacy>=3.7.0; extra == "audio"
Requires-Dist: num2words>=0.5.0; extra == "audio"
Requires-Dist: loguru>=0.7.0; extra == "audio"
Requires-Dist: phonemizer>=3.2.0; extra == "audio"
Requires-Dist: ordered_set>=4.1.0; extra == "audio"
Requires-Dist: cn2an>=0.5.0; extra == "audio"
Requires-Dist: fugashi>=1.3.0; extra == "audio"
Requires-Dist: unidic-lite>=1.0.0; extra == "audio"
Requires-Dist: jieba>=0.42.0; extra == "audio"
Dynamic: license-file

<img width="1600" height="800" alt="banner" src="https://github.com/user-attachments/assets/f3743bb7-7287-4b24-ac97-a7037974396f" />
<p align="center">

<h1 align="center">Rapid-MLX</h1>

<p align="center">
  <strong>Run AI on your Mac. Faster than anything else.</strong>
</p>

<p align="center">
  <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python 3.10+"></a>
  <a href="tests/"><img src="https://img.shields.io/badge/tests-3300%2B-brightgreen.svg" alt="Tests"></a>
  <a href="https://support.apple.com/en-us/HT211814"><img src="https://img.shields.io/badge/Apple_Silicon-M1%20|%20M2%20|%20M3%20|%20M4-black.svg?logo=apple" alt="Apple Silicon"></a>
  <a href="https://github.com/raullenchai/Rapid-MLX/stargazers"><img src="https://img.shields.io/github/stars/raullenchai/Rapid-MLX?style=social" alt="GitHub stars"></a>
</p>

<p align="center">
  Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.
</p>

<p align="center">
  <sub>
    <a href="https://rapidmlx.com"><b>rapidmlx.com</b></a> ·
    <a href="https://rapidmlx.com/desktop">Desktop app</a> ·
    <a href="https://rapidmlx.com/performance/">Community benchmarks</a> ·
    <a href="https://models.rapidmlx.com/">Model mirror</a>
  </sub>
</p>

<p align="center">
  <img src="https://raw.githubusercontent.com/raullenchai/Rapid-MLX/main/docs/assets/demo.gif" alt="Rapid-MLX demo — install, serve Gemma 4, chat, tool calling" width="700">
  <br>
  <em>pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.</em>
</p>

| | Your Mac | Model | Speed (tok/s) | What works |
|:---|:---:|:---:|:---:|:---:|
| **16 GB** MacBook Air | Qwen3.5-4B | 147 tok/s | Chat, coding, tools |
| **24 GB** MacBook Pro | Qwen3.5-9B | 101 tok/s | Great all-rounder |
| **32+ GB** Mac Mini / Studio | 🆕 Gemma 4 12B | 64 tok/s | Vision-capable + tools |
| **32+ GB** Mac Mini / Studio | GPT-OSS 20B | 119 tok/s | Harmony-native, 100% tools |
| **32+ GB** Mac Mini / Studio | Qwen3.6-35B-A3B | 93 tok/s | 256 MoE experts, 262K context |
| **48+ GB** Mac Mini / Studio | Qwen3.5-35B-A3B 8bit | 80 tok/s | Best balance of smart + fast |
| **96+ GB** Mac Studio / Pro | Qwen3.5-122B | 57 tok/s¹ | Frontier-level intelligence |
| **128+ GB** Mac Studio Ultra | DeepSeek V4 Flash 158B-A13B | 31-56 tok/s¹ | Day-0 frontier MoE, 1M context |

<sub>Single-user end-to-end throughput (B=1: one request at a time, 256 max output tokens, `output_tokens / wall-clock` incl. first-token latency), median of 3 rounds. `chat_template_kwargs.enable_thinking=False` passed where the engine honours it. Tested on M3 Ultra 256 GB / rapid-mlx v0.6.83 (fused top-p sampler). ¹ carried over from 2026-04 bench — disk-constrained on this refresh.</sub>

<details>
<summary><b>New to local AI? Quick glossary</b></summary>

- **tok/s** (tokens per second) — roughly how many words the AI generates per second. Higher = faster.
- **4bit / 8bit** — compression levels for models. 4bit uses less memory (recommended); 8bit is higher quality.
- **TTFT** (Time To First Token) — how long before the AI starts responding.
- **Tool calling** — the AI can call functions in your code. Used by Cursor, Claude Code, and coding assistants.
- **OpenAI API compatible** — Rapid-MLX speaks the same language as ChatGPT's API, so any app that works with ChatGPT can work with Rapid-MLX by just changing the server address.
- **Ollama / llama.cpp** — other popular tools for running local AI. The only **apples-to-apples** row in our table is GPT-OSS 20B (identical weights both sides) — Rapid-MLX runs it **2.3x faster than Ollama** under B=4 concurrent load. On the **Qwen3 closest-tag** rows (Qwen3.5/3.6 DeltaNet isn't on llama.cpp yet, so we compare against `qwen3:Nb`) Rapid-MLX leads 1.7–2.4x. The **Gemma 4 row** is tied at parity with Ollama's Gemma 3 (different architectures, 1.0x). Against `mlx-lm serve` (same MLX weights) Rapid-MLX is **1.2–1.5x faster**. Full caveats in [Benchmarks](#benchmarks).

</details>

---

## Quick Start

**Step 1 — Install** (pick one):

```bash
# uv (recommended — one command, isolated env, auto-manages Python)
uv tool install rapid-mlx@latest
# Don't have uv yet? Install it first: curl -LsSf https://astral.sh/uv/install.sh | sh

# Or one-liner with auto-setup (installs Python if needed)
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

# Homebrew (Mac-native — needs tap + trust before install on Homebrew 4.x)
brew tap raullenchai/rapid-mlx
brew trust raullenchai/rapid-mlx
brew install rapid-mlx

# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)
pip install rapid-mlx
```

Upgrade later: `uv tool upgrade rapid-mlx` / `brew upgrade rapid-mlx` / `pip install -U rapid-mlx`.

> **Vision/multimodal models** (Gemma 4, Qwen-VL, etc.) need extras: `pip install 'rapid-mlx[vision]'`. Text-only install is ~460 MB; vision adds ~322 MB. See [Optional Extras](#optional-extras) for the full list.

> **"No matching distribution" error?** Your Python is too old. Run `python3 --version` — if it says 3.9, install a newer Python: `brew install python@3.12` then `python3.12 -m pip install rapid-mlx`

> **`Refusing to load formula ... from untrusted tap`?** Homebrew 4.x requires third-party taps to be explicitly trusted before install. The `brew trust raullenchai/rapid-mlx` line above is what marks the tap as trusted — without it, even after `brew tap`, the install is refused. Trust is per-machine and persists across upgrades.

> **`Tapping homebrew/core` / `Operation not permitted` during `brew install`?** Brew 5.x's install sandbox can't auto-tap `homebrew/core` mid-install. Pre-tap it once, then retry:
> ```bash
> brew tap homebrew/core --force   # ~1.3 GB, one-time
> brew tap raullenchai/rapid-mlx
> brew trust raullenchai/rapid-mlx
> brew install rapid-mlx
> ```

**Step 2 — Talk to a model right now** (one command, no second terminal):
```bash
rapid-mlx chat
```
Defaults to `qwen3.5-4b-4bit`. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type `/help` for slash commands, `/exit` to quit. Pass `--think` to surface chain-of-thought.

**Step 2b — Or serve a model for use from other apps:**
```bash
rapid-mlx serve qwen3.5-4b-4bit
```
Same model, same download — but this starts an OpenAI-compatible HTTP server instead of a REPL. Wait for `Ready: http://localhost:8000/v1`.

> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b-4bit` (~14 GB).

**Step 3 — Hit the API** (from a second terminal tab):
```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'
```

That's it — you now have an OpenAI-compatible AI server on `localhost:8000`. Point any app at `http://localhost:8000/v1` and it just works.

**Step 4 — Share it publicly** (optional — get a `https://` URL anyone can hit):
```bash
rapid-mlx share qwen3.6-27b-8bit
```
This spawns the same local serve and tunnels it through `rapidserver.quicksilverpro.io` over a WebSocket. Your terminal prints a public OpenAI-compatible endpoint plus a bearer key — point any chat UI or OpenAI SDK at it. Bearer auth, a locked-down CORS allowlist, and a default 120 RPM rate-limit are wired on the spawned child; closing the terminal tears the tunnel down.

The default chat surface is our hosted Big-AGI fork (tool calling, personas, voice — no signup); any OpenAI-compatible client also works, e.g. `OPENAI_API_BASE_URL=<share-url>/v1 OPENAI_API_KEY=<bearer> open-webui serve`.

> Pick a 27B-class model or larger for a usable share experience — 4B is fine for local dev but too small for live chat (`rapid-mlx models` lists all aliases).

> **Want a Claude Code-like TUI?** Rapid-MLX is the *backend* — pair it with an open-source agent CLI like [OpenCode](https://github.com/sst/opencode) or [codex](https://github.com/openai/codex) for the full slash-commands / tool-use / multi-turn experience. Run `rapid-mlx agents opencode --setup` (or `codex --setup`) to wire it up automatically.

> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller/faster model, try `rapid-mlx serve qwen3.5-9b-4bit` (~5 GB).

<details>
<summary>More install options</summary>

**From source** (for development):
```bash
git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX && pip install -e .
```

**Vision models** (adds mlx-vlm + opencv + torch, ~322 MB extra):
```bash
pip install 'rapid-mlx[vision]'
```

**Audio** (TTS/STT via mlx-audio):
```bash
pip install 'rapid-mlx[audio]'
```
</details>

> **Not into the terminal?** [**Rapid-MLX Desktop**](https://rapidmlx.com) is a Mac app that bundles the same `rapid-mlx` engine inside a one-click GUI — drag to Applications, pick a model, chat. No Python, no `pip`, no `brew`. The CLI here is still the source of truth for serving and scripting; the desktop app is the friendlier on-ramp.

**Try it with Python** (make sure the server is running, then `pip install openai`):

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")  # any value works, no real key needed

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)
```

---

## Works With

### Agent Harnesses (MHI-tested)

| Harness | Type | Notes |
|---------|------|-------|
| [Hermes Agent](https://github.com/NousResearch/hermes-agent) | Agent | 62 tools, multi-turn ([test](tests/integrations/test_hermes.py)) |
| [PydanticAI](https://ai.pydantic.dev) | Framework | Typed agents, structured output ([test](tests/integrations/test_pydantic_ai_full.py)) |
| [LangChain](https://langchain.com) | Framework | `ChatOpenAI`, tools, streaming ([test](tests/integrations/test_langchain.py)) |
| [smolagents](https://github.com/huggingface/smolagents) | Framework | CodeAgent + ToolCallingAgent ([test](tests/integrations/test_smolagents_full.py)) |
| [OpenClaude](https://github.com/Gitlawb/openclaude) (Anthropic SDK) | Agent | `CLAUDE_CODE_USE_OPENAI=1` ([test](tests/integrations/test_anthropic_sdk.py)) |
| [Aider](https://aider.chat) | Agent | CLI edit-and-commit, architect mode ([test](tests/integrations/test_aider.sh)) |
| [Goose](https://github.com/block/goose) | Agent | Ollama provider via `OLLAMA_HOST` |
| [OpenCode](https://github.com/sst/opencode) | TUI Agent | Claude Code-like terminal UX, OpenAI-compat provider |
| [Codex CLI](https://github.com/openai/codex) | Agent | OpenAI's official Rust agent — `/v1/responses` shim, verified end-to-end against codex 0.136.0 on Qwen3.5-9B / Qwen3.6-27B (chat + file read/write + shell + multi-step + source analysis); release gauntlet G7b probes the codex-shape SSE on every tag ([guide](docs/guides/codex-cli.md), [release gate](docs/development/releasing.md)) |
| [Claude Code](https://www.anthropic.com/claude-code) | Agent | Anthropic SDK via `/v1/messages` — `ANTHROPIC_BASE_URL=http://localhost:8000` |
| [Claw Code](https://github.com/ultraworkers/claw-code) | Agent | OpenAI & Anthropic endpoints |

### UI / IDE Clients

| Client | Status | Setup |
|--------|--------|-------|
| [Cursor](https://cursor.com) | Compatible | Settings → OpenAI Base URL |
| [Claude Code](https://claude.ai/code) | Tested | One command ([see below](#claude-code)) |
| [Continue.dev](https://continue.dev) | Compatible | VS Code / JetBrains extension |
| [LibreChat](https://librechat.ai) | Tested | Docker ([test](tests/integrations/test_librechat_docker.py)) |
| [Open WebUI](https://github.com/open-webui/open-webui) | Tested | Docker ([test](tests/integrations/test_openwebui.py)) |
| Any OpenAI-compatible app | Compatible | Point at `http://localhost:8000/v1` |

### Claude Code

**Claude Code** (Anthropic's web-based code editor) works with Rapid-MLX via the Anthropic Messages API endpoint (`/v1/messages`).

**Terminal 1 — Start the Rapid-MLX server:**
```bash
rapid-mlx serve qwen3.5-9b-4bit
```
Wait for: `Ready: http://localhost:8000/v1`

**Terminal 2 — Launch Claude Code pointing to your local server:**
```bash
export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_API_KEY="not-needed"
claude --model claude-opus-4-5
```

The server accepts any `claude-*` or `gpt-*` model name in requests and routes them to the loaded engine (configured on the server). The response always reflects the actual loaded model, not the client-requested name. This means:
- Claude Code can use `--model claude-opus-4-5` or any other alias
- The server runs whatever model you specified with `rapid-mlx serve <model>`
- Tool calling and streaming work out of the box with Qwen3.5 / Qwen3.6 models

**Tip:** For the best Claude Code experience on a 24 GB MacBook Pro, use `qwen3.5-9b-4bit` — it's smart enough for coding tasks while staying responsive.

### Model-Harness Index (MHI)

MHI measures how well a model works with a specific agent harness. It combines three dimensions:

| Dimension | Weight | What it measures | Source |
|---|---|---|---|
| **Tool Calling** | 50% | Can the model+harness execute function calls correctly? | `rapid-mlx agents --test` |
| **HumanEval** | 30% | Can the model generate correct code? | [HumanEval](https://github.com/openai/human-eval) (10 tasks) |
| **MMLU** | 20% | Does the harness degrade base knowledge? | [tinyMMLU](https://huggingface.co/datasets/tinyBenchmarks/tinyMMLU) (10 tasks) |

**MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU** (scale 0-100)

| Model | Best MHI | Best Harness | Tool Calling |
|---|---|---|---|
| **Qwopus 27B** | **92** | All (Hermes, PydanticAI, LangChain, smolagents) | 100% |
| **Qwen3.5 27B** | **82** | Hermes / PydanticAI / LangChain | 100% |
| **Llama 3.3 70B** | **83** | smolagents (text-based) | 100% |
| **Nemotron Nano 30B** | **59** | PydanticAI / LangChain | 91-93% |
| **Gemma 4 26B** | **62** | Hermes / smolagents | 100% |

<details>
<summary>Full MHI table (25 model-harness combinations) + methodology</summary>

**MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU** (scale 0-100)

Run `rapid-mlx agents` to see all supported agents and `python3 scripts/mhi_eval.py` to compute MHI on your own setup.

| Model + Harness | Tool Calling | HumanEval | MMLU | **MHI** |
|---|---|---|---|---|
| **Qwopus 27B** + Hermes | 100% | 80% | 90% | **92** |
| **Qwopus 27B** + PydanticAI | 100% | 80% | 90% | **92** |
| **Qwen3.5 27B** + Hermes | 100% | 40% | 100% | **82** |
| **Llama 3.3 70B** + smolagents | 100% | 50% | 90% | **83** |
| **DeepSeek-R1 32B** + smolagents | 100% | 30% | 100% | **79** |
| **Gemma 4 26B** + Hermes | 100% | 0% | 60% | **62** |
| **Nemotron Nano 30B** + PydanticAI | 93% | 0% | 60% | **59** |

</details>

**Quick setup for popular apps:**

**Cursor:** Settings → Models → Add Model:
```
OpenAI API Base:  http://localhost:8000/v1
API Key:          not-needed
Model name:       default          (or qwen3.5-9b-4bit — either works)
```
Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.

**Claw Code:**
```bash
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
claw --model "openai/default" prompt "summarize this repo"
```

**OpenClaude:**
```bash
CLAUDE_CODE_USE_OPENAI=1 OPENAI_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=not-needed OPENAI_MODEL=default openclaude -p "hello"
```

**Hermes Agent** (`~/.hermes/config.yaml`):
```yaml
model:
  provider: "custom"
  default: "default"
  base_url: "http://localhost:8000/v1"
  context_length: 32768
```

**Goose:**
```bash
GOOSE_PROVIDER=ollama OLLAMA_HOST=http://localhost:8000 \
GOOSE_MODEL=default goose run --text "hello"
```

**Claude Code:**
```bash
OPENAI_BASE_URL=http://localhost:8000/v1 claude
```

<details>
<summary><strong>More client setup instructions</strong></summary>

**Continue.dev** (`~/.continue/config.yaml`):
```yaml
models:
  - name: rapid-mlx
    provider: openai
    model: default
    apiBase: http://localhost:8000/v1
    apiKey: not-needed
```

**Aider:**
```bash
aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed
```

**Swival** (`~/.swival/config.toml`):
```toml
[profiles.rapidmlx]
provider = "generic"
base_url = "http://127.0.0.1:8000"
model = "default"
```

Run with:
```bash
swival --profile rapidmlx "summarize this repo"
```

**Open WebUI** (Docker one-liner):
```bash
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e ENABLE_OLLAMA_API=False \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=not-needed \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main
```

**OpenCode** (`opencode.json` in your project root):
```json
{
  "provider": {
    "openai": {
      "api": "http://localhost:8000/v1",
      "models": {
        "default": {
          "name": "rapid-mlx local",
          "limit": { "context": 32768, "output": 8192 }
        }
      },
      "options": { "apiKey": "not-needed" }
    }
  }
}
```

**Codex CLI** (`~/.codex/config.toml` — or run `rapid-mlx agents codex --setup` to write this for you):
```toml
model = "default"
model_provider = "rapid-mlx"

[model_providers.rapid-mlx]
name = "Rapid-MLX (local)"
base_url = "http://localhost:8000/v1"
# If rapid-mlx was started with --api-key, add env_key = "RAPID_MLX_API_KEY"
# and `export RAPID_MLX_API_KEY=...`. Don't use api_key = "..." — Codex
# CLI's --strict-config rejects inline literals.
```

Then `codex` (or `codex exec '<query>'`) talks to the local model via `/v1/responses`. See the [Codex CLI guide](docs/guides/codex-cli.md) for the full setup.

**PydanticAI** (`pip install pydantic-ai`):
```python
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)
agent = Agent(model)
print(agent.run_sync("What is 2+2?").output)
```

**smolagents** (`pip install smolagents`):
```python
from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(
    model_id="default",
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
)
agent = CodeAgent(tools=[], model=model)
agent.run("What is 5 multiplied by 7?")
```

**LibreChat** (`librechat.yaml`, under `endpoints.custom`):
```yaml
- name: "Rapid-MLX"
  apiKey: "rapid-mlx"
  baseURL: "http://localhost:8000/v1/"
  models:
    default: ["default"]
    fetch: true
  titleConvo: true
  titleModel: "current_model"
  modelDisplayLabel: "Rapid-MLX"
```

**Anthropic SDK** (`pip install anthropic`):
```python
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

message = client.messages.create(
    model="default",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Say hello"}],
)
print(message.content[0].text)
```

</details>

---

## Choose Your Model

### What fits my Mac?

The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monitor shows red memory pressure, pick a smaller model from the table below.

> **Browse the full catalog** at [**models.rapidmlx.com**](https://models.rapidmlx.com/) — 80+ MLX-quantised models on a free R2 mirror with resumable downloads, no HuggingFace rate limits. `rapid-mlx pull <alias>` fetches from there automatically.

| Your Mac | Best Model | RAM Used | Speed (B=1) | Quality |
|----------|-----------|---------|-------|---------|
| **16 GB** MacBook Air/Pro | [Qwen3.5-4B 4bit](https://huggingface.co/mlx-community/Qwen3.5-4B-MLX-4bit) | 2.4 GB | 147 tok/s | Good for chat and simple tasks |
| **24 GB** MacBook Pro | [Qwen3.5-9B 4bit](https://huggingface.co/mlx-community/Qwen3.5-9B-4bit) | 5.1 GB | 101 tok/s | Great all-rounder |
| **32 GB** Mac Mini / Studio | [Qwen3.5-27B 4bit](https://huggingface.co/mlx-community/Qwen3.5-27B-4bit) | 15.3 GB | 37 tok/s | Solid coding model |
| **32 GB** Mac Mini / Studio | 🆕 [Gemma 4 12B 4bit](https://huggingface.co/mlx-community/gemma-4-12B-it-4bit) | 7 GB | 64 tok/s | Vision-capable + tool calling |
| **32 GB** Mac Mini / Studio | [GPT-OSS 20B MXFP4](https://huggingface.co/mlx-community/gpt-oss-20b-MXFP4-Q8) | 11 GB | 119 tok/s | Harmony-native, 100% tools |
| **32 GB** Mac Mini / Studio | [Qwen3.6-35B-A3B 4bit](https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-4bit) | 20 GB | 93 tok/s | 256 MoE experts, 262K context |
| **36 GB** MacBook Pro M3/M4 Pro | [Qwen3.5-27B 4bit](https://huggingface.co/mlx-community/Qwen3.5-27B-4bit) | 15.3 GB | 37 tok/s | Same as 32 GB — extra headroom for long contexts |
| **48 GB** Mac Mini / Studio | [Qwen3.5-35B-A3B 8bit](https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-8bit) | 37 GB | 80 tok/s | **Sweet spot** — smart + fast |
| **64 GB** Mac Mini / Studio | [Qwen3.5-35B-A3B 8bit](https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-8bit) | 37 GB | 80 tok/s | Same model, more room for KV cache |
| **96 GB** Mac Studio / Pro | [Qwen3.5-122B mxfp4](https://huggingface.co/nightmedia/Qwen3.5-122B-A10B-Text-mxfp4-mlx) | 65 GB | 57 tok/s¹ | Best model, fits comfortably |
| **128 GB** Mac Studio / Pro | 🆕 [DeepSeek V4 Flash 2-bit DQ](https://huggingface.co/mlx-community/DeepSeek-V4-Flash-2bit-DQ) | 91 GB | 56 tok/s¹ | 158B-A13B frontier MoE, day-0 (chat only) |
| **192 GB** Mac Studio / Pro | [Qwen3.5-122B 8bit](https://huggingface.co/mlx-community/Qwen3.5-122B-A10B-8bit) | 130 GB | 44 tok/s¹ | Maximum quality |
| **256 GB** Mac Studio Ultra | 🆕 [DeepSeek V4 Flash 8-bit](https://huggingface.co/mlx-community/DeepSeek-V4-Flash-8bit) | 136 GB | 31 tok/s¹ | 158B-A13B frontier MoE, 1M context (chat only) |

<sub>Speed = single-user end-to-end throughput (B=1: one request, 256 max output tokens, output_tokens / wall-clock including first-token latency), median of 3 rounds. rapid-mlx v0.6.83 (fused top-p sampler) on M3 Ultra 256 GB, 2026-06-09. ¹ Carried over from prior bench (disk-constrained on this refresh).</sub>

> **4bit vs 8bit:** 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. "mxfp4" is a high-quality 4bit format.

### Naming convention

Every alias follows the same template so you can read off the model family, parameter count, training technique, and quantization at a glance:

`<family>-<version>-<params>-<modality?>-<technique?>-<quant>`

| Segment | Meaning | Examples |
|---|---|---|
| **family** | Model family | `gemma`, `qwen`, `llama`, `mistral`, `deepseek`, `phi` |
| **version** | Major version | `-4`, `3.5`, `3.6`, `-r1`, `-v4-flash` |
| **params** | Parameter count (MoE includes the active count) | `12b`, `27b`, `35b-a3b` (35B total / 3B active) |
| **modality** *(optional)* | Non-text variants | `-vl` (vision), `-coder` (code) |
| **technique** *(optional)* | Training-time modifier | `-qat` (Quantization-Aware Training), `-distill`, `-thinking` |
| **quant** *(mandatory)* | Quantization tier (see below) | `-4bit`, `-8bit`, `-mxfp4`, `-qat-8bit`, … |

The **quantization suffix is mandatory on every alias** — `qwen3.5-4b-4bit` not `qwen3.5-4b`, `gemma-4-12b-qat-8bit` not `gemma-4-12b-qat`. This mirrors LM Studio's `…-MLX-4bit` / `…-MLX-8bit` HuggingFace convention so you never have to guess the bit width.

| Suffix | Meaning |
|---|---|
| `-4bit` | Standard MLX 4-bit (most common) |
| `-8bit` | Standard MLX 8-bit (higher quality, ~2× RAM) |
| `-2bit`, `-3bit`, `-6bit` | Other bit widths |
| `-mxfp4` | Microscaling FP4 (high-quality 4-bit) |
| `-mxfp4-q8` | MXFP4 weights + Q8 head (GPT-OSS style) |
| `-dwq` | Dynamic Weight Quantization (mlx-community) |
| `-ud` | Unsloth Dynamic (mixed-precision per-layer) |
| `-unpacked` | Original FP16 / BF16 weights, no quantization |

`-qat` is a *technique* suffix, not a quant — it stacks before the quant. So a QAT-trained Gemma 4 12B in 4-bit is `gemma-4-12b-qat-4bit`, and the 8-bit variant is `gemma-4-12b-qat-8bit`.

Decoded examples:

- `gemma-4-12b-qat-4bit` = Gemma 4 · 12B params · QAT-trained · 4-bit quant
- `qwen3.5-35b-8bit` = Qwen 3.5 · 35B params (3B active MoE) · 8-bit quant
- `gpt-oss-20b-mxfp4-q8` = GPT-OSS · 20B params · MXFP4 weights + Q8 head
- `bonsai-1.7b-unpacked` = Bonsai · 1.7B params · no quantization

### Full model lineup

81 explicit aliases across 13 families ship today. Run `rapid-mlx models` for the live list with parser, hybrid / MoE flags, and DFlash eligibility.

<details>
<summary><strong>Show all 81 aliases by family</strong></summary>

| Family | Aliases | Notable |
|---|---|---|
| **Qwen3.5** | `qwen3.5-4b-4bit`, `-4b-8bit`, `-9b-4bit`, `-9b-8bit`, `-27b-4bit`, `-27b-8bit` ✨, `-35b-4bit`, `-35b-8bit`, `-122b-mxfp4`, `-122b-8bit` | DeltaNet hybrid; **27b-8bit DFlash-eligible** |
| **Qwen3.6** | `qwen3.6-27b-4bit`, `-27b-8bit` ✨, `-27b-ud`, `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud` | 262K ctx, 256 MoE experts; **27b-8bit DFlash-eligible** |
| **Qwen3** | `qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder-4bit`, `qwen3-coder-30b-4bit`, `qwen3-vl-4b-4bit`, `-8b-4bit`, `-30b-4bit` | Coding + vision |
| **Qwopus** | `qwopus-9b-4bit`, `qwopus-27b-4bit`, `qwopus-27b-8bit` | 92 MHI on tool calling |
| **DeepSeek** | `deepseek-r1-8b-4bit`, `-32b-4bit`, `deepseek-v4-flash-2bit`, `-4bit`, `-8bit` | R1 reasoning + V4 Flash 158B-A13B day-0 |
| **Gemma** | `gemma-3n-e4b-4bit`, `gemma-4-12b-4bit`, `-12b-qat-4bit`, `-12b-qat-8bit`, `-26b-4bit`, `-26b-qat-4bit`, `-31b-4bit`, `-31b-8bit`, `-31b-qat-4bit`, `-31b-qat-8bit`, `gemma3-1b-4bit`, `-12b-4bit`, `-27b-4bit` | Vision-capable; QAT variants |
| **Llama / Hermes** | `llama3-1b-4bit`, `-3b-4bit`, `llama-3.1-8b-8bit`, `hermes3-8b-4bit`, `hermes4-70b-4bit` | |
| **GLM** | `glm4.5-air-4bit`, `glm4.7-9b-4bit` | |
| **GPT-OSS** | `gpt-oss-20b-mxfp4-q8` | Harmony native |
| **MiniMax** | `minimax-m2.5-4bit`, `minimax-m2.7-mxfp4` | |
| **Mistral / Devstral** | `mistral-24b-4bit`, `devstral-24b-4bit`, `devstral-v2-24b-4bit`, `ministral-3b-4bit` | |
| **Other** | `phi-4-14b-4bit`, `phi-4-mini-4bit`, `smollm3-3b-4bit`, `nemotron-30b-4bit`, `bonsai-1.7b-unpacked`, `-4b-unpacked`, `-8b-unpacked`, `granite4-tiny-4bit` | |
| 🆕 **Text-Diffusion** | `diffusion-gemma-26b-4bit`, `diffusion-gemma-26b-8bit` | Non-autoregressive (block denoising); same `/v1/chat/completions` API |

✨ = DFlash speculative decoding supported (opt in with `--enable-dflash`). `rapid-mlx info <alias>` shows per-alias capabilities.

</details>

### Copy-paste commands

Pick the one that matches your Mac. Run `rapid-mlx models` to see all available aliases.

```bash
# 16 GB — lightweight, fast
rapid-mlx serve qwen3.5-4b-4bit --port 8000

# 24 GB — best small model
rapid-mlx serve qwen3.5-9b-4bit --port 8000

# 32 GB — solid coding model
rapid-mlx serve qwen3.5-27b-4bit --port 8000

# 32 GB — Gemma 4 12B (vision-capable, 64 tok/s)
rapid-mlx serve gemma-4-12b-4bit --port 8000

# 32 GB — GPT-OSS 20B (harmony-native, 100% tool calling, 119 tok/s)
rapid-mlx serve gpt-oss-20b-mxfp4-q8 --port 8000

# 32+ GB — Qwen 3.6 35B-A3B (256 experts, 262K context, 93 tok/s)
rapid-mlx serve qwen3.6-35b-4bit --port 8000

# 48+ GB — sweet spot (Qwen3.5-35B-A3B 8bit, 80 tok/s)
rapid-mlx serve qwen3.5-35b-8bit --prefill-step-size 8192 --port 8000  # faster first response

# 96+ GB — frontier (Qwen3.5-122B mxfp4)
rapid-mlx serve qwen3.5-122b-mxfp4 --prefill-step-size 8192 --port 8000

# Coding agent — fast MoE, great for Claude Code / Cursor
rapid-mlx serve qwen3-coder-4bit --prefill-step-size 8192 --port 8000  # MoE = only uses part of the model, so it's fast

# Vision — image understanding (see note below)
rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000

# 🆕 Text-diffusion — DiffusionGemma 26B-A4B (block denoising, not autoregressive)
rapid-mlx serve diffusion-gemma-26b-4bit --port 8000  # needs [vision] extras for mlx-vlm 0.6.3+
```

> **Vision deps:** Install into the same environment where rapid-mlx lives:
> - `install.sh` users: `~/.rapid-mlx/bin/pip install 'rapid-mlx[vision]'`
> - `pip` users: `pip install 'rapid-mlx[vision]'` (in the same venv)
> - `brew` users: `$(brew --prefix)/opt/rapid-mlx/libexec/bin/pip install 'rapid-mlx[vision]'`

### 🆕 Text-Diffusion (DiffusionGemma 26B-A4B)

DiffusionGemma is a **non-autoregressive** language model — instead of emitting one token at a time, it denoises whole blocks of tokens in parallel via a diffusion process. Rapid-MLX wraps it behind the standard OpenAI Chat Completions API, so any client (chat UIs, agent harnesses, your own scripts) talks to it the same way it talks to Qwen / Gemma / GPT-OSS.

```bash
pip install 'rapid-mlx[vision]'       # mlx-vlm 0.6.3+ provides the diffusion runtime
rapid-mlx serve diffusion-gemma-26b-4bit --port 8000
```

**B=1 single-user benchmark** (M3 Ultra 256 GB, mlx-community/diffusiongemma-26B-A4B-it-4bit, median of 3 runs + 1 warmup):

| `max_tokens` | TTFT | E2E | Aggregate tok/s |
|---:|---:|---:|---:|
| 64 | 1.47s | 1.47s | 43 |
| 256 | 6.00s | 6.00s | 43 |
| 1024 | 5.71s | 19.58s | 37 |

> Diffusion models emit tokens in **whole denoising blocks**, so the conventional `decode_tok/s = tokens / (e2e − ttft)` metric isn't meaningful here (ttft ≈ e2e for short outputs). The table reports **aggregate** throughput — `tokens / total_wall_time` — i.e. how many tokens actually land in the chat window per second. Throughput climbs with output length because the per-step denoising cost amortizes across more emitted tokens.

Reproduce the table: `python3.12 scripts/bench_diffusion_gemma.py --port 8000`.

<details>
<summary><strong>Parser auto-detection & manual overrides</strong></summary>

Parsers are **auto-detected from the model name** — you don't need to specify `--tool-call-parser` or `--reasoning-parser` for supported families. Explicit flags always override auto-detection.

| Model Family | Auto-detected `--tool-call-parser` | Auto-detected `--reasoning-parser` | Notes |
|-------------|---------------------|---------------------|-------|
| Qwen3.5 (all sizes) | `hermes` | `qwen3` | **Recommended** — 100% tool calling |
| 🆕 Qwen3.6 | `qwen3_coder_xml` | `qwen3` | XML tool format, 262K context |
| Qwen3-Coder-Next | `hermes` | *(none)* | Fast coding, non-thinking mode |
| DeepSeek R1-0528 / V3.1 | `deepseek_v31` | `deepseek_r1` | Dedicated V3.1 parser |
| DeepSeek R1 (older) | `deepseek` | `deepseek_r1` | With reasoning |
| DeepSeek V3 / V2.5 | `deepseek` | *(none)* | No reasoning parser |
| GLM-4.7 | `glm47` | *(none)* | 100% tool calling |
| MiniMax-M2.5 | `minimax` | `minimax` | XML tool format |
| GPT-OSS | `harmony` | `harmony` | Native format |
| Kimi-Linear | `kimi` | *(none)* | Kimi tool format |
| Llama 3.x | `llama` | *(none)* | JSON tool format |
| Mistral / Devstral | `hermes` | *(none)* | Hermes-compatible |
| Gemma | `hermes` | *(none)* | Hermes-compatible |
| Phi-3/4 | `hermes` | *(none)* | Hermes-compatible |

All 17 parsers include automatic recovery — if a quantized model outputs broken tool calls as text, they're auto-converted back to structured format.

</details>

---

## Benchmarks

Tested on **Mac Studio M3 Ultra (256 GB)**, 2026-06-06. Workload is **B=4 sustained concurrent streaming** (four parallel chat requests, 256 max output tokens each), median of 3 measured rounds after one warmup discard. Engines were swapped sequentially with an 8 s Metal cooldown so contention never crossed engine boundaries.

`chat_template_kwargs.enable_thinking=False` is passed to all engines that honour it (rapid-mlx, mlx-lm, mlx-vlm). Ollama 0.24 ignores that hook for Qwen3 and keeps streaming reasoning chunks — those decode at the same model rate as content tokens, so we count them, and the Qwen3 Ollama numbers reflect chain-of-thought-on throughput in practice. Token counts come from the streaming `usage` chunk (authoritative), not from counting SSE frames.

Versions: rapid-mlx **v0.6.80**, mlx-lm **0.31.3**, Ollama **0.24.0** (latest stable).

Aggregate throughput = sum of output tokens across all four streams ÷ wall-clock seconds — the metric that matters for a server fronting multiple users or a TUI firing parallel sub-agents. Per-user decode is roughly aggregate ÷ 4 on a true batching engine; on Ollama 0.24 (no in-flight batching) the four streams effectively serialize, so the per-stream decode-only rate (`output_tokens / (e2e − ttft)`, recorded as `median_per_stream_tps` in the raw JSON) sits at or slightly above the aggregate. That is expected and not a metric mismatch — decode-only excludes prefill while aggregate spans the entire wall-clock window. The Ollama daemon also caches the previously loaded model in memory across rows; `OllamaEngine.stop()` only unloads the row's own tag, so cross-row Metal residency effects are possible — `ollama ps` between rows shows what's actually resident.

| Model (rapid-mlx alias) | rapid-mlx (B=4) | mlx-lm serve | Ollama tag (closest) | Ollama (B=4) | vs mlx-lm | vs Ollama |
|---|---:|---:|---|---:|:-:|:-:|
| **Qwen3.5-4B** | **261** tok/s | 173 | `qwen3:4b`¹ | 120 | **1.51x** | **2.18x** |
| **Qwen3.5-9B** | **180** tok/s | 136 | `qwen3:8b`¹ | 84 | **1.32x** | **2.14x** |
| **Qwen3.5-27B** | **66** tok/s | 55 | `qwen3:32b`² | 27 | **1.20x** | **2.43x** |
| **Gemma 4 12B** | **55** tok/s | crash³ | `gemma3:12b`⁴ | 56 | — | 1.00x |
| **GPT-OSS 20B** | **221** tok/s | 162 | `gpt-oss:20b` ✅ | 97 | **1.36x** | **2.29x** |
| **Qwen3.6-35B-A3B** (4-bit) | **176** tok/s | 129 | `qwen3:30b-a3b`⁵ | 87 | **1.37x** | **2.02x** |
| **Qwen3.5-35B-A3B** (8-bit) | **151** tok/s | 112 | `qwen3:30b-a3b`⁵ | 87 | **1.35x** | **1.74x** |

✅ Direct apples-to-apples: identical weights both sides.

<sub>¹ Ollama Qwen3 base, not Qwen3.5 — DeltaNet hybrid arch isn't on llama.cpp yet. ² Closest dense Qwen3; Unsloth Qwen3.6-27B GGUF fails to load on Ollama 0.24. ³ mlx-lm 0.31.3 has no Gemma 4 loader (it lives in mlx-vlm). ⁴ Gemma 4 not yet on llama.cpp — Gemma 3 is the closest. ⁵ Closest MoE A3B available; Qwen3.5/3.6-35B-A3B don't have a llama.cpp build yet.</sub>

> **Different Mac?** Numbers above are one M3 Ultra. See community-submitted runs across M1/M2/M3/M4 Apple Silicon at [**rapidmlx.com/performance**](https://rapidmlx.com/performance/) — sortable by chip × model × version. Submit your own with `rapid-mlx bench <alias> --submit`.

*Full benchmark data with all models, TTFT tables, DeltaNet snapshots, and engine comparison below.*

Reproduce the throughput table:

```bash
python3.12 scripts/bench_readme_refresh.py \
  --models qwen3.5-4b-4bit,qwen3.5-9b-4bit,qwen3.5-27b-4bit,gemma-4-12b-4bit,gpt-oss-20b-mxfp4-q8,qwen3.6-35b-4bit,qwen3.5-35b-8bit \
  --engines rapid-mlx,mlx-lm,ollama
```

Raw JSON per round + per-stream tok/s land in `reports/benchmarks/readme-refresh/`.

<details>
<summary><strong>TTFT — Prompt Cache Advantage</strong></summary>

Prompt cache keeps multi-turn conversations fast. For standard transformers, KV cache trimming gives sub-100ms TTFT. For hybrid RNN models (Qwen3.5 DeltaNet), we use state snapshots — the first technique to bring prompt cache to non-trimmable architectures on MLX.

<sub>Numbers below were last verified 2026-04 — the prefix-cache code path has not changed since. The 2026-06 throughput refresh focused on decode tok/s under concurrent load; a TTFT refresh is tracked separately.</sub>

**Pure KV cache (transformers):**

| Model | Rapid-MLX (cached) | mlx-lm serve | Speedup |
|-------|-------------------|-------------------|---------|
| Kimi-Linear-48B | **0.08s** | — | — |
| Llama 3.2 3B | **0.10s** | — | — |
| Hermes-3-Llama 8B | **0.10s** | 0.18s | 1.8x |
| Phi-4 Mini 14B | **0.13s** | 0.15s | 1.2x |
| Devstral-Small-2 24B | **0.13s** | 0.38s | 2.9x |
| Mistral Small 24B | **0.13s** | 0.38s | 2.9x |
| GLM-4.7-Flash 9B | **0.13s** | 0.23s | 1.8x |
| GLM-4.5-Air | **0.14s** | 0.47s | 3.4x |
| Qwen3-Coder-Next 80B | **0.16s** | 0.27s | 1.7x |
| GPT-OSS 20B | **0.16s** | 0.27s | 1.7x |
| Qwen3.5-9B | **0.22s** | 0.26s | 1.2x |
| Gemma 4 E4B | **0.25s** | — (day-0) | — |
| Gemma 4 26B-A4B | **0.25s** | — (day-0) | — |
| Gemma 4 31B | **0.34s** | 0.57s (mlx-vlm bf16) | **1.7x** |

**DeltaNet state snapshots (hybrid RNN + attention):**

Qwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines recreate the entire cache from scratch every request — we snapshot the RNN state at the system prompt boundary, restoring in ~0.1ms instead of re-running hundreds of tokens through the recurrent layers.

| Model | Cold TTFT | Snapshot TTFT | Speedup |
|-------|-----------|---------------|---------|
| Qwen3-Coder-Next 6bit (48L) | 0.66s | **0.16s** | **4.3x** |
| Qwen3.5-35B-A3B 8bit (40L) | 0.49s | **0.19s** | **2.6x** |
| Qwen3.5-27B 4bit (40L) | 0.58s | **0.27s** | **2.1x** |
| Qwen3.5-9B 4bit (40L) | 0.27s | **0.22s** | **1.2x** |
| Qwen3.5-4B 4bit (32L) | 0.24s | **0.16s** | **1.5x** |

</details>

<details>
<summary><strong>Capability Comparison</strong></summary>

| Feature | Rapid-MLX | oMLX | Ollama | llama.cpp | mlx-lm serve |
|---------|-----------|------|--------|-----------|-------------|
| **Tool calling** | 100% (Qwen/GLM/GPT-OSS/Kimi) | N/A | 100% (Qwen) | 80% (Phi-4) | N/A |
| **Tool call recovery** | 100% | N/A | 100% | 100% | N/A |
| **Tool injection fallback** | Yes | No | No | No | No |
| **Think-tag leak** | 0% | N/A | 0% | 0% | N/A |
| **Prompt cache** | KV + DeltaNet | No | No | No | No |
| **Vision** | Yes | Yes | Yes | No | No |
| **Audio (STT/TTS)** | Yes | No | No | No | No |
| **17 tool parsers** | Yes | No | No | No | No |
| **Cloud routing** | Yes | No | No | No | No |
| **Streaming** | Yes | Yes | Yes | Yes | Yes |
| **OpenAI API** | Yes | Yes | Yes | Yes | Yes |

</details>

<details>
<summary><strong>Optimization Techniques Per Model</strong></summary>

| Technique | What it does | Models |
|-----------|-------------|--------|
| **KV prompt cache** | Trim KV cache to common prefix, skip re-prefill | All transformer models |
| **DeltaNet state snapshots** | Deep-copy RNN state at prefix boundary, restore in ~0.1ms | Qwen3.5 (4B, 9B, 27B, 35B, 122B), Qwen3-Coder-Next |
| **Hybrid cache sync** | Keep trimmable KV + non-trimmable RNN layers in sync | Qwen3.5 (Gated DeltaNet + attention) |
| **Tool logits bias** | Jump-forward decoding — bias logits toward structured tokens | All models with `--enable-tool-logits-bias` |
| **Auto tool recovery** | Detect broken text-format tool calls, convert to structured | All 17 parser formats (incl. Gemma 4) |
| **TurboQuant V-cache** | Rotate + Lloyd-Max compress V cache (86% savings on dense models) | All models with `--kv-cache-turboquant` |
| **KV cache quantization** | Quantize prefix cache entries to reduce memory | All models with `--kv-cache-quantization` |
| **DFlash speculative decoding** | Block-diffusion drafter, parallel draft + verify | `qwen3.5-27b-8bit`, `qwen3.6-27b-8bit` (single-user) |
| **SuffixDecoding** | Drafter-free, statistical n-gram lookup speculative decoding | All BatchedEngine models with `--suffix-decoding` |
| **Prefill chunking** | Configurable step size for large-prompt throughput | All models |
| **Cloud routing** | Offload high-token requests to cloud LLM when local is slow | All models with `--cloud-model` |

</details>

<details>
<summary><strong>Eval benchmarks (20 models, 4 suites)</strong></summary>

Tool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), general knowledge (MMLU-Pro). Top models:

| Model | Decode (B=1) | Tools | Code | Reason | General | Avg |
|-------|--------|-------|------|--------|---------|-----|
| Qwen3.5-122B 8bit | 44 t/s¹ | 87% | 90% | 90% | 90% | **89%** |
| Qwen3.5-35B 8bit | 59 t/s | 90% | 90% | 80% | 80% | **85%** |
| Qwen3-Coder-Next 4bit | 74 t/s¹ | 90% | 90% | 70% | 70% | **80%** |
| Qwen3.5-27B 4bit | 33 t/s | 83% | 90% | 50% | 80% | **76%** |
| Qwen3.5-9B 4bit | 100 t/s | 83% | 70% | 60% | 70% | **71%** |

<sub>Decode = single-user end-to-end throughput refreshed 2026-06-06 against rapid-mlx v0.6.80. ¹ Carried over from the 2026-04 bench (not re-measured this round).</sub>

Run your own: `bash evals/run_all_models.sh` runs the full quality suite (tool calling, coding, reasoning, general) across every alias and emits a fresh `evals/SCORECARD.md`. The `Decode` column above is the throughput rapid-mlx achieves on each row — see the [Benchmarks](#benchmarks) section for the cross-engine throughput reproduction command.

</details>

---

## Features

### Tool Calling

Full OpenAI-compatible tool calling with 17 parser formats and **automatic recovery when quantized models break**. Models at 4-bit degrade after multiple tool rounds — Rapid-MLX auto-detects broken output and converts it back to structured `tool_calls`.

### Reasoning Separation

Models with chain-of-thought (Qwen3, DeepSeek-R1) output reasoning in a separate `reasoning_content` field — cleanly separated from `content` in streaming mode. Works with Qwen3, DeepSeek-R1, MiniMax, and GPT-OSS reasoning formats.

### Prompt Cache

Persistent cache across requests — only new tokens are prefilled on each turn. For standard transformers, KV cache trimming. For hybrid models (Qwen3.5 DeltaNet), RNN state snapshots restore non-trimmable layers from memory instead of re-computing. 2-5x faster TTFT on all architectures. Always on, no flags needed.

### Smart Cloud Routing

Large-context requests auto-route to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be slow. Routing based on new tokens after cache hit. `--cloud-model openai/gpt-5 --cloud-threshold 20000`

### Multimodal

Vision, audio (STT/TTS), video understanding, and text embeddings — all through the same OpenAI-compatible API.

### PFlash Prefill Acceleration

Long prompts are slow to *start* — the first token waits on the whole context to prefill, and on Apple Silicon that prefill is the bottleneck. PFlash scores a long prompt and prefills only the tokens that matter (the attention sink, the recent tail, and the query-relevant middle), cutting **cold-start TTFT by 3.87–8.5×** on 32K+ prompts with full needle-in-a-haystack recall. Pure-Python, no extra dependencies, and **on by default** for verified models.

```bash
rapid-mlx serve qwen3.5-9b-4bit          # PFlash auto-on for verified aliases
rapid-mlx serve <model> --pflash always  # force it on for any model
```

PFlash speeds up the prompt going *in*; **DFlash** speeds up the tokens coming *out*.

### DFlash Speculative Decoding (single-user)

z-lab's block-diffusion drafter (via mlx-vlm) accelerates single-stream generation on validated Qwen3.5/3.6 27B aliases. Opt in with `--enable-dflash`:

| Alias | Drafter | Avg speedup | Min / Max |
|---|---|---|---|
| `qwen3.6-27b-8bit` | `z-lab/Qwen3.6-27B-DFlash` | **1.49×** | 1.06× / 2.07× |
| `qwen3.5-27b-8bit` | `z-lab/Qwen3.5-27B-DFlash` | **1.31×** | 0.59× / 2.15× |

```bash
pip install 'rapid-mlx[dflash]'
rapid-mlx info qwen3.5-27b-8bit       # check per-gate eligibility
rapid-mlx serve qwen3.5-27b-8bit --enable-dflash
```

**Best on coding, math, and summarization** (typically **1.5–2.7×**); open-ended creative writing can dip below 1×, so DFlash is gated to the aliases where it reliably wins.

**v1 limitations**: DFlash mode runs a dedicated single-user server (mlx-vlm doesn't expose a batched DFlash kernel yet). Tool calling, MCP, and embeddings aren't available in DFlash mode — restart without `--enable-dflash` for those.

Also: a fused single-kernel sampler (faster than mlx-vlm's, identical sampling math), logprobs API, structured JSON output (`response_format`), continuous batching, KV cache quantization (`--kv-cache-quantization`), and [3300+ tests](tests/).

---

<details>
<summary><strong>Server Flags Reference</strong></summary>

> You don't need any flags to get started — the defaults work for most setups. These are for advanced tuning.

### Core

| Flag | Description | Default |
|------|-------------|---------|
| `<model>` | HuggingFace model name, local path, or alias (positional arg) | *(required)* |
| `--host` | Host to bind to | `0.0.0.0` |
| `--port` | Port to bind to | `8000` |
| `--max-tokens` | Default max tokens for generation | `32768` |

### Tool Calling & Reasoning

| Flag | Description | Default |
|------|-------------|---------|
| `--tool-call-parser` | Parser: `hermes`, `minimax`, `qwen`, `llama`, `deepseek`, etc. | *(auto-detected)* |
| `--reasoning-parser` | Parser: `qwen3`, `deepseek_r1`, `minimax`, `gpt_oss`, `harmony`, `glm4`, `gemma4` | *(auto-detected)* |
| `--enable-tool-logits-bias` | Jump-forward decoding for faster tool calls | off |

### Performance

| Flag | Description | Default |
|------|-------------|---------|
| `--prefill-step-size` | Tokens per prefill chunk | `2048` |
| `--kv-cache-turboquant` | TurboQuant V-cache compression (3-4 bit, 86% savings on dense models) | off |
| `--kv-cache-quantization` | Quantize prefix cache entries for memory savings | off |
| `--enable-prefix-cache` / `--disable-prefix-cache` | Cache common prefixes across requests | on |
| `--enable-dflash` | DFlash speculative decoding (single-user; `qwen3.5-27b-8bit` / `qwen3.6-27b-8bit`) | off |
| `--suffix-decoding` | Drafter-free n-gram speculative decoding (BatchedEngine path) | off |
| `--enable-mtp` | MTP head speculative decoding (requires MTP-trained model) | off |
| `--gpu-memory-utilization` | Fraction of device memory to use (0.0-1.0) | `0.90` |

### Cloud Routing

| Flag | Description | Default |
|------|-------------|---------|
| `--cloud-model` | litellm model string (e.g. `openai/gpt-5`) | *(disabled)* |
| `--cloud-threshold` | New token threshold to trigger cloud routing | `20000` |

### Security & Other

| Flag | Description | Default |
|------|-------------|---------|
| `--api-key` | API key for authentication | *(no auth)* |
| `--rate-limit` | Requests per minute per client | *(unlimited)* |
| `--timeout` | Request timeout in seconds | `1800` |
| `--mllm` / `--no-mllm` | Force / disable multimodal (vision) mode | auto-detect |
| `--force-openai-harmony-streaming` | Force the openai-harmony streaming router on (escape hatch — debug-only, raises on non-harmony tokenizers) | auto-detect |
| `--no-openai-harmony-streaming` | Disable the openai-harmony streaming router; fall back to the legacy state machine | auto-detect |
| `--mcp-config` | MCP configuration file for tool integration | *(none)* |
| `--embedding-model` | Pre-load embedding model at startup | *(none)* |

</details>

<details>
<summary><strong>Common Issues</strong></summary>

**"parameters not found in model" warnings at startup** — Normal for VLMs. Vision weights are auto-skipped.

**Out of memory / very slow (<5 tok/s)** — Model too big. Check [What fits my Mac?](#what-fits-my-mac) Try a smaller quantization (4bit) or smaller model.

**Empty responses** — Remove `--reasoning-parser` for non-thinking models.

**Tool calls as plain text** — Set the correct `--tool-call-parser` for your model. Even without it, Rapid-MLX auto-recovers most cases.

**Other issues?** Run `rapid-mlx doctor` for self-diagnostics.

**Slow first response** — Two different causes: (1) Qwen3.5 models reason before answering — add `--no-thinking` to skip reasoning for faster responses, or (2) cold start on long prompts — add `--prefill-step-size 8192` to speed up processing. Subsequent turns hit prompt cache and are 10-30x faster.

</details>

---

## Optional Extras

The base `pip install rapid-mlx` is ~460 MB and covers all text-only models. Vision, audio, and other features ship as opt-in extras:

| Extra | Install | Adds | What it unlocks |
|---|---|---|---|
| `vision` | `pip install 'rapid-mlx[vision]'` | ~322 MB | Gemma 4, Qwen-VL, video understanding (mlx-vlm + opencv + torch) |
| `audio` | `pip install 'rapid-mlx[audio]'` | ~600 MB | TTS / STT (mlx-audio + spacy + scipy) |
| `embeddings` | `pip install 'rapid-mlx[embeddings]'` | ~50 MB | `/v1/embeddings` endpoint (mlx-embeddings) |
| `chat` | `pip install 'rapid-mlx[chat]'` | ~150 MB | Built-in Gradio chat UI |
| `guided` | `pip install 'rapid-mlx[guided]'` | ~80 MB | Schema-constrained JSON generation (outlines) |
| `all` | `pip install 'rapid-mlx[all]'` | ~1.1 GB | Vision + audio + chat + embeddings |

If you installed via Homebrew and want vision/audio support, use `pip install 'rapid-mlx[vision]'` (or `[audio]`) inside your own Python 3.10+ venv — that gives you the full feature set without rebuilding the brew formula.

---

## Troubleshooting

Run the built-in environment-health probe (works from `pip install`, no dev tools needed, no model load, ≤5 s):

```bash
rapid-mlx doctor
```

```
┌─────────────────────────────────────────────────────────┐
│                  🩺 Rapid-MLX Doctor                    │
└─────────────────────────────────────────────────────────┘

◆ System
  ✓ Apple Silicon (Apple M3 Pro, 36 GB)
  ✓ macOS 14.3 (Darwin 23.3.0)
  ✓ Free disk: 162 GB
◆ Python
  ✓ Python 3.12.13
◆ Required Packages
  ✓ mlx 0.29.x
  ✓ mlx-lm 0.31.x
  ✓ transformers 5.x
  ✓ fastapi 0.x
  ✓ uvicorn 0.x
  ✓ rapid-mlx 0.7.22
◆ HuggingFace Cache / Network / Shell Integration / Optional Tools
  ...
────────────────────────────────────────
Summary: 18 ok, 4 warnings, 0 issues
```

Use `rapid-mlx doctor --verbose` to see the underlying probe detail (exact path, version, response code) for each check.

Want to validate model inference instead? That lives at `rapid-mlx bench <model> --tier {smoke,check,full,benchmark}` — doctor is purely env-health now.

---

## Telemetry

Rapid-MLX **can** send anonymous usage data to help us prioritise the right models and catch regressions. **It is off by default and never starts collecting without your explicit opt-in.**

### What we collect (only if you opt in)

- Subcommand names (`serve` / `chat` / `agents` / `bench` / `doctor`)
- Model alias names (`qwen3.5-9b-4bit`) or canonical HF repo IDs (`mlx-community/...`) — local paths are redacted to `<local>`
- Bucketed counts: prompt/completion tokens, TTFT, tokens/sec — never exact values
- Error categories + a hash fingerprint of the failure site (exception class name + per-frame `file:function:lineno` only — never the message text or absolute paths)
- OS, arch, Apple chip name, RAM (rounded to GB), Python major.minor

### What we never collect

- Prompts, completions, tool-call arguments, file contents, or any user-generated text
- Local file paths, working directory, or model paths beyond their HF repo ID
- IPs or hostnames (Phase 2 will route through a Cloudflare Worker that strips IPs before forwarding to the aggregator; Phase 1 ships no transport at all)
- API keys, environment variable values, auth headers
- Stack trace messages or argument values

### Manage it

```bash
rapid-mlx telemetry status     # show current state and why
rapid-mlx telemetry preview    # print the exact JSON payload that would be sent
rapid-mlx telemetry enable     # opt in
rapid-mlx telemetry disable    # opt out
rapid-mlx telemetry reset      # delete consent + client-id files (re-prompts on next run)
```

### Force-disable in scripts / CI

Either of these always wins, regardless of stored consent:

```bash
RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b-4bit
rapid-mlx --no-telemetry serve qwen3.5-9b-4bit
```

There is intentionally **no env-var equivalent for force-on** — opting in must be an explicit one-time `rapid-mlx telemetry enable`. CI agents will never silently contribute.

### Where the code lives

Everything is in [`vllm_mlx/telemetry/`](vllm_mlx/telemetry/) — read it. Phase 1 (this release) ships the consent mechanism and CLI surface; **no network code is in the codebase yet**. Phase 2 will add the transport behind the same opt-in gate; the schema is documented in [`vllm_mlx/telemetry/schema.py`](vllm_mlx/telemetry/schema.py). Tracking issue: [#236](https://github.com/raullenchai/Rapid-MLX/issues/236).

---

## Development

### Quick start

```bash
git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX
pip install -e ".[dev]"
```

### Testing

Two layers: **user-facing doctor** (ships with pip) and **dev test suite** (source checkout only).

#### Dev test commands

| Command | What | Time | Needs server? |
|---------|------|------|---------------|
| `make lint` | ruff lint | ~10s | No |
| `make test` | pytest unit suite (3300+ tests) | ~30s | No |
| `make smoke` | lint + unit | ~1 min | No |
| `make stress` | 8-scenario stress test | ~5 min | Yes |
| `make soak` | 10-min agent soak test | 10 min | Yes |

For stress/soak, start a server first:
```bash
rapid-mlx serve mlx-community/Qwen3.5-4B-MLX-4bit --enable-auto-tool-choice --tool-call-parser hermes
# In another terminal:
make stress
```

Or use the script directly for more options:
```bash
python scripts/dev_test.py smoke              # lint + unit
python scripts/dev_test.py stress --port 8000 # custom port
python scripts/dev_test.py full               # everything
```

#### Regression harness (multi-model)

```bash
make check              # 1 model (~10 min, auto starts server)
make full               # 3 models + 12 agent profiles (~1 hr)
make benchmark          # all local models (overnight)
```

### Architecture

```
vllm_mlx/
  server.py              # App factory + model loading + CLI entry
  config/                # ServerConfig singleton
  service/
    helpers.py           # Shared request helpers
    postprocessor.py     # Streaming pipeline (100% test coverage)
  routes/
    chat.py              # /v1/chat/completions
    completions.py       # /v1/completions
    anthropic.py         # /v1/messages (Anthropic API)
    health.py, models.py, embeddings.py, audio.py, mcp_routes.py
  engine/                # BatchedEngine (continuous batching)
  reasoning/             # 7 reasoning parsers (Qwen3, DeepSeek, MiniMax, ...)
  tool_parsers/          # 17 tool call parsers
  speculative/           # DFlash, SuffixDecoding, MTP drafters
  agents/                # 12 agent profiles (YAML)
  runtime/               # Model registry, cache persistence
  doctor/                # Environment-health probe (rapid-mlx doctor)
  bench/                 # Model-validation tier runner (rapid-mlx bench --tier ...)
scripts/                 # Dev-only (NOT shipped with pip)
  dev_test.py            # Unified test entry point
  stress_test.py         # 8-scenario stress test
  agent_soak_test.py     # 10-min agent soak test
  mhi_eval.py            # Compute MHI scores against a running server
tests/                   # pytest unit tests (3300+)
harness/                 # Regression baselines + thresholds
```

---

## Roadmap

| Technique | Expected Gain | Status |
|-----------|---------------|--------|
| [DFlash](https://arxiv.org/abs/2602.06036) — block-diffusion drafter, single-user | 1.3-2× decode | **Shipping** (qwen3.5-27b-8bit, qwen3.6-27b-8bit) |
| [SuffixDecoding](https://arxiv.org/abs/2411.04975) — drafter-free n-gram speculative | 1.1-1.5× decode | Shipping (`--suffix-decoding`, per-model tier sweep ongoing) |
| MTP — Multi-Token Prediction head | 1.4-1.7× decode | Experimental (requires MTP-trained checkpoint) |
| [EAGLE-3](https://arxiv.org/abs/2503.01840) — feature-level draft on Metal | 3-6.5× decode | Not started |
| [ReDrafter](https://arxiv.org/abs/2403.09919) — Apple's RNN draft head | 1.4-1.5× decode | Not started |

---

## Contributing

We welcome contributions of all sizes! See [CONTRIBUTING.md](CONTRIBUTING.md) for setup and guidelines.

**Easy first contributions** (no model download needed):
- [Add a model alias](https://github.com/raullenchai/Rapid-MLX/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) — map a short name to a HuggingFace model ID
- [Request model support](https://github.com/raullenchai/Rapid-MLX/issues/new?template=model_support.yml) — tell us which model you want

**Testing contributions** (needs a Mac with Apple Silicon):
- **Share your hardware's benchmark numbers** — one command:
  ```bash
  rapid-mlx bench qwen3.5-9b-4bit --submit
  ```
  Runs the standardized B=1 bench (greedy, 128 + 512 token buckets, 5 rounds each), shows you the JSON payload, asks for consent, and opens the PR for you via `gh`. If you don't have `gh`, it prints the JSON path + a deep-link to GitHub's compare page so you can open the PR in your browser. Submitted rows land in [community-benchmarks/submissions/](community-benchmarks/submissions/) and show up on https://rapidmlx.com once merged.
- Test with your favorite AI client (Cursor, Aider, LangChain, etc.)
- [Report a bug](https://github.com/raullenchai/Rapid-MLX/issues/new?template=bug_report.yml)

### Contributors

<a href="https://github.com/raullenchai/Rapid-MLX/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=raullenchai/Rapid-MLX" />
</a>

## Star History

<a href="https://star-history.com/#raullenchai/Rapid-MLX&Date">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=raullenchai/Rapid-MLX&type=Date&theme=dark" />
    <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=raullenchai/Rapid-MLX&type=Date" />
    <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=raullenchai/Rapid-MLX&type=Date" />
  </picture>
</a>

## License

Apache 2.0 — see [LICENSE](LICENSE).
