Metadata-Version: 2.4
Name: vllm-swift
Version: 0.4.0
Summary: vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon
Author: Tom Turney (TheTom)
License: Apache-2.0
Project-URL: Homepage, https://github.com/TheTom/vllm-swift
Project-URL: Repository, https://github.com/TheTom/vllm-swift
Project-URL: Issues, https://github.com/TheTom/vllm-swift/issues
Keywords: vllm,mlx,swift,metal,apple-silicon,llm,inference,kv-cache,turboquant
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: download
Requires-Dist: huggingface-hub>=0.20; extra == "download"
Dynamic: license-file

<p align="center">
  <img src="assets/logo.png" alt="vllm-swift" width="400">
</p>

<p align="center">
  A native Swift/Metal backend for <a href="https://github.com/vllm-project/vllm">vLLM</a> on Apple Silicon.<br>
  <b>No Python in the inference hot path.</b>
</p>

<p align="center">
  Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.<br>
  OpenAI-compatible API. Up to 2.6× faster short-context decode.
</p>

## Quick Start

### 1. Install

**Homebrew** (recommended for Mac power users):

```bash
brew tap TheTom/tap && brew install vllm-swift
```

**pip** (everyone else, including dev containers and non-brew Macs):

```bash
pip install vllm-swift
```

The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.

**From source:**

```bash
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)
```

### 2. Run

```bash
vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960
```

> Homebrew users don't need `activate.sh` — `vllm-swift serve` handles everything.

Server running at `http://localhost:8000` (OpenAI-compatible API).

> Drop-in replacement for vLLM on Apple Silicon. All `vllm serve` flags work unchanged.

## Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). **vllm-swift** uses the Swift/Metal engine via ctypes. **vllm-metal** uses the Python/MLX engine via vLLM's offline `LLM` API.

### Qwen3-0.6B

| | Single | 8 concurrent | 32 concurrent | 64 concurrent |
|---|:---:|:---:|:---:|:---:|
| **vllm-swift** | **364** | **1,527** | **2,859** | **3,425** |
| vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |

### Qwen3-4B

| | Single | 8 concurrent | 32 concurrent | 64 concurrent |
|---|:---:|:---:|:---:|:---:|
| **vllm-swift** | **147** | **477** | **1,194** | **1,518** |
| vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |

> Full matrix, methodology, and long-context cells in [docs/PERFORMANCE.md](docs/PERFORMANCE.md).

### [TurboQuant+](https://github.com/TheTom/turboquant_plus) KV Cache Compression

[TurboQuant+](https://github.com/TheTom/turboquant_plus) compresses KV cache to fit longer context with modest throughput cost.

**Qwen3.5 2B (4-bit weights)**

| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |
|----------|:-----------:|:----------:|:----------:|:----------:|:----------:|
| FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s |
| turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s |
| turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |

## Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

```
Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU
```

## Features

- OpenAI-compatible API (`/v1/completions`, `/v1/chat/completions`)
- Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with `BatchedKVCache` (fully batched projections + attention)
- Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- [TurboQuant+](https://github.com/TheTom/turboquant_plus) KV cache compression (`turbo3`, `turbo4v2`) via mlx-swift-lm
- Decode and prompt logprobs
- Greedy and temperature sampling
- EOS / stop token detection (vLLM scheduler)
- VLM (vision-language model) support (experimental)
- Works with [Hermes](https://github.com/nousresearch/hermes-agent), [OpenCode](https://github.com/anomalyco/opencode), and any OpenAI-compatible client

## Use with AI tools

```bash
# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes
```

Then point your tool at it:

```bash
# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'
```

## Configuration

`vllm-swift serve` is a thin wrapper around `vllm serve` — all standard vLLM flags work. Here are the common setups:

### Basic serving

```bash
vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960
```

### Agent / tool calling (Hermes, OpenCode, etc.)

```bash
vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes
```

### Chain-of-thought models (strip `<think>` tags)

```bash
vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1
```

### Long context with [TurboQuant+](https://github.com/TheTom/turboquant_plus)

Compress KV cache 3-5× to fit longer context with modest throughput cost:

```bash
vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
```

| Scheme | Compression | Best for |
|--------|:-----------:|----------|
| `turbo4v2` | ~3× | Recommended — best quality/compression balance |
| `turbo3` | ~4.6× | Maximum compression, higher PPL trade-off |

### Full setup (agent + reasoning + TurboQuant+)

```bash
vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
```

### All flags

```bash
vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)
```

All standard [vLLM flags](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) work — these are just the most common ones.

## Documentation

| Doc | What's in it |
|---|---|
| [docs/PERFORMANCE.md](docs/PERFORMANCE.md) | Full perf matrix vs vllm-metal, methodology, long-context cells |
| [docs/MODEL_COMPATIBILITY.md](docs/MODEL_COMPATIBILITY.md) | Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing) |
| [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) | Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.) |
| [CHANGELOG.md](CHANGELOG.md) | Release history |

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for release history.

## Known Limitations (early development)

- **LoRA** not supported (Swift engine limitation)
- **Chunked prefill** disabled (Swift engine handles full sequences)
- **top_p sampling** not supported in batched decode path (temperature works)
- Only **Qwen3** models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
- Requires macOS on Apple Silicon (no Linux/CUDA)

## Install

### Homebrew

```bash
brew tap TheTom/tap && brew install vllm-swift
```

Prebuilt bottle — no Swift toolchain needed. First run of `vllm-swift serve` sets up a managed Python environment automatically.

To update to the latest version:

```bash
vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift
```

### From source

```bash
git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
```

### Manual (full control)

```bash
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
```

### Troubleshooting

**Homebrew checksum error on reinstall:**
```bash
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift
```

**"No module named vllm" or plugin not loading after brew install:**
```bash
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup
```

**vLLM build error (Apple Clang parentheses):** Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:
```bash
# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm
```

**activate.sh not found:** Make sure you run `./install.sh` (or `./scripts/install.sh`) first — it generates `activate.sh` in the project root.

**Metal kernel not found (GDN/TurboFlash models):** The `mlx.metallib` file must be in the same directory as `libVLLMBridge.dylib`. For manual installs, copy it:
```bash
cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/
```

### Download a model

```bash
vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest
```

## Project Structure

```
vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage
```

## Requirements

- macOS 14+ on Apple Silicon
- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
- Python 3.10+
- [vLLM](https://github.com/vllm-project/vllm) 0.19+
- [mlx-swift-lm](https://github.com/TheTom/mlx-swift-lm/tree/vllm-swift-stable) (pulled automatically by Swift Package Manager)

## License

Apache-2.0
