Metadata-Version: 2.4
Name: mlx-qwen3-asr
Version: 0.3.5
Summary: Qwen3-ASR speech recognition on Apple Silicon via MLX
Author: dmoon
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/moona3k/mlx-qwen3-asr
Project-URL: Repository, https://github.com/moona3k/mlx-qwen3-asr
Project-URL: Issues, https://github.com/moona3k/mlx-qwen3-asr/issues
Keywords: speech-recognition,asr,qwen,mlx,apple-silicon,transcription
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: MacOS
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx>=0.18.0
Requires-Dist: numpy
Requires-Dist: regex
Requires-Dist: huggingface-hub>=0.20.0
Provides-Extra: aligner
Requires-Dist: nagisa; extra == "aligner"
Requires-Dist: soynlp; extra == "aligner"
Provides-Extra: diarize
Requires-Dist: pyannote.audio<5,>=4.0.4; extra == "diarize"
Provides-Extra: mic
Requires-Dist: sounddevice>=0.4.6; extra == "mic"
Provides-Extra: serve
Requires-Dist: fastapi>=0.100.0; extra == "serve"
Requires-Dist: uvicorn>=0.20.0; extra == "serve"
Requires-Dist: python-multipart>=0.0.5; extra == "serve"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: eval
Requires-Dist: soundfile>=0.12.1; extra == "eval"
Requires-Dist: pyarrow>=18.0.0; extra == "eval"
Dynamic: license-file

# mlx-qwen3-asr

[![PyPI version](https://img.shields.io/pypi/v/mlx-qwen3-asr.svg)](https://pypi.org/project/mlx-qwen3-asr/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://pypi.org/project/mlx-qwen3-asr/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

Run [Qwen3-ASR](https://huggingface.co/collections/Qwen/qwen3-asr) — one of the strongest open-source speech recognition models — natively on Apple Silicon.

A ground-up reimplementation of the [official PyTorch model](https://github.com/QwenLM/Qwen3-ASR) using Apple's [MLX](https://github.com/ml-explore/mlx) framework. Same weights, benchmarked against official/reference outputs and ground-truth eval sets, optimized for Mac GPUs via Metal. No PyTorch dependency for core transcription.

## Why this exists

[Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) is one of the strongest open-source ASR models available, with benchmark results exceeding Whisper-large-v3 across multiple languages and datasets. It supports 30 languages plus 22 Chinese dialects. But the official implementation is **PyTorch + NVIDIA CUDA** — it doesn't use Apple GPUs.

This project rewrites every layer for MLX so the same model runs natively on M1/M2/M3/M4 hardware. Not a wrapper — a full reimplementation with correct interleaved MRoPE, per-chunk windowed encoder attention, and all the architectural details that matter for output quality.

### What's included

- **Full encoder-decoder pipeline** — audio encoder (Conv2d stem + windowed transformer) and text decoder (Qwen3-style with interleaved MRoPE), reimplemented from scratch for MLX
- **Whisper-compatible mel frontend** — native log-mel spectrogram computation with cached filterbank and Hann window
- **Both model sizes** — 0.6B (fast, default) and 1.7B (higher accuracy)
- **Long audio support** — energy-based chunking up to 20 minutes per chunk, no 30-second feature truncation
- **Word-level timestamps** — native MLX forced aligner (default, 2.6x faster than PyTorch alternative) with O(n log n) LIS-based timestamp correction
- **Speaker diarization (optional)** — offline speaker-labeled outputs via `pyannote` integration (`--diarize`)
- **4-bit and 8-bit quantization** — up to 4.7x speedup with measured quality reporting on 100 speaker-balanced samples
- **Multiple output formats** — txt, json, srt, vtt, tsv
- **Built-in HTTP server** — `mlx-qwen3-asr serve` exposes the pipeline over HTTP with async jobs, OpenAI API compatibility, and Bearer token auth
- **Session API** — explicit model/tokenizer ownership with no hidden global state
- **Speculative decoding** — experimental opt-in path (0.6B drafts for 1.7B target), parity-verified
- **Streaming** — KV-cache streaming with linear complexity, context trimming, and tail refinement
- **Native WAV fast-path** — custom binary WAV parser bypasses ffmpeg for PCM/float WAV files
- **462 tests** — every optimization is benchmark-gated with committed JSON artifacts
- **Minimal dependencies** — mlx, numpy, regex, huggingface-hub

## Requirements

- **Apple Silicon Mac** (M1/M2/M3/M4) — this is an MLX project, Metal GPU required
- **Python 3.10+**
- **ffmpeg** — required for non-WAV audio formats (mp3, m4a, flac, mp4, etc.). WAV files work without ffmpeg via the native fast-path loader
- **~1.2 GB memory** for 0.6B model (fp16), **~3.4 GB** for 1.7B

## Installation

Install from PyPI:

```bash
pip install mlx-qwen3-asr
```

For video and most non-WAV audio formats, install `ffmpeg` on your system:

```bash
brew install ffmpeg
```

Install with optional timestamp alignment extras (for Japanese/Korean tokenization parity):

```bash
pip install "mlx-qwen3-asr[aligner]"
```

Install with optional microphone capture support:

```bash
pip install "mlx-qwen3-asr[mic]"
```

Install with HTTP server support:

```bash
pip install "mlx-qwen3-asr[serve]"
```

Install with diarization extras:

```bash
pip install "mlx-qwen3-asr[diarize]"
```

Note: `--diarize` uses pyannote.audio 4.x and defaults to
`pyannote/speaker-diarization-community-1`. Accept the model terms on
Hugging Face and set a token:

```bash
export PYANNOTE_AUTH_TOKEN=hf_...
```

Core ASR does not require any Hugging Face token.

For development:

```bash
git clone https://github.com/moona3k/mlx-qwen3-asr.git
cd mlx-qwen3-asr
pip install -e ".[dev]"
```

## Quick start

### Python API

```python
from mlx_qwen3_asr import transcribe

result = transcribe("audio.wav")
print(result.text)
print(result.language)
```

By default, `transcribe()` uses `Qwen/Qwen3-ASR-0.6B` for fast local usage on Mac. Use `Qwen/Qwen3-ASR-1.7B` when you want higher accuracy and can afford higher latency/memory.

With options:

```python
result = transcribe(
    "meeting.mp3",
    model="Qwen/Qwen3-ASR-1.7B",
    language="English",
    return_chunks=True,
    on_progress=lambda e: print(e["event"], e.get("progress", 0.0)),
    verbose=True,
)
print(result.text)
print(result.chunks)
```

### Session API (recommended for repeated calls)

The `Session` object owns model and tokenizer state explicitly — no hidden globals, no cache surprises:

```python
from mlx_qwen3_asr import Session

session = Session(model="Qwen/Qwen3-ASR-0.6B")

# Fast repeated transcription — model stays loaded
for audio_file in audio_files:
    result = session.transcribe(audio_file)
    print(result.text)
```

### Loading models explicitly

```python
from mlx_qwen3_asr import load_model, load_audio, transcribe

model, config = load_model("Qwen/Qwen3-ASR-0.6B")
audio = load_audio("speech.wav")
result = transcribe(audio, model=model)
```

### CLI

```bash
mlx-qwen3-asr audio.wav
```

Specify model, language, and output format:

```bash
mlx-qwen3-asr recording.mp3 --model Qwen/Qwen3-ASR-0.6B --language English -f srt -o output/
```

Word-level timestamps:

```bash
mlx-qwen3-asr audio.wav --timestamps
```

Speaker-labeled output (experimental, offline):

```bash
mlx-qwen3-asr meeting.wav --diarize --num-speakers 2 -f json
```

Multiple files with all output formats:

```bash
mlx-qwen3-asr *.wav -f all -o transcripts/ --verbose
```

Stdout/file behavior:

```bash
mlx-qwen3-asr audio.wav --stdout-only        # print only (no output file)
mlx-qwen3-asr audio.wav --quiet -o out/      # write files only (no stdout text)
```

Language discovery:

```bash
mlx-qwen3-asr --list-languages
```

Environment diagnostics (ffmpeg, optional diarization deps, token status):

```bash
mlx-qwen3-asr --doctor
```

Run `mlx-qwen3-asr --help` for the full list of options.

### HTTP server

Serve transcriptions over HTTP. Two endpoint styles: an async job API and an OpenAI-compatible synchronous endpoint.

```bash
pip install "mlx-qwen3-asr[serve]"
mlx-qwen3-asr serve --api-key $(openssl rand -hex 16)
```

Submit audio and poll for results:

```bash
# Submit
curl -X POST http://localhost:8765/transcribe \
  -H "Authorization: Bearer YOUR_KEY" \
  -F "audio=@recording.wav"

# Poll
curl http://localhost:8765/jobs/JOB_ID \
  -H "Authorization: Bearer YOUR_KEY"
```

Or use the OpenAI-compatible endpoint with existing SDK code:

```python
from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="http://localhost:8765/v1")
result = client.audio.transcriptions.create(
    model="Qwen/Qwen3-ASR-0.6B",
    file=open("recording.wav", "rb"),
)
print(result.text)
```

The async API is better for long audio (no HTTP timeout risk). The OpenAI endpoint blocks until done — simpler for short clips and SDK integration.
The server also implements `/v1/models` for SDK clients that perform model discovery.

See [docs/server/](docs/server/) for the full API spec, deployment guide, and architecture decision record.
See [examples/](examples/) for copy-paste workflows covering the OpenAI-compatible server, subtitles, meetings, scanner/noisy audio, and batch folders.

## Performance on Apple Silicon

Measured on Apple M4 Pro, macOS 26.2. All numbers from controlled runs with benchmark JSON artifacts committed to the repo. See [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for the full breakdown.

### Latency (0.6B)

| Configuration | Short clip (~2.5s) | 10s clip | Real-time factor | vs fp16 |
|---|---|---|---|---|
| **fp16** (baseline) | 0.46s | 0.83s | 0.08x | — |
| **8-bit** (q8, group 64) | 0.11s | 0.27s | 0.03x | 3.11x faster |
| **4-bit** (q4, group 64) | 0.13s | 0.18s | 0.02x | **4.68x faster** |

### English quality refresh (LibriSpeech, 100 speaker-balanced samples per subset)

| Model | Subset | WER | CER | Mean eval latency | RTF |
|---|---|---|---|---|---|
| 0.6B | test-clean | 2.29% | 0.59% | 0.86s | 0.0957 |
| 0.6B | test-other | 4.20% | 2.09% | 0.71s | 0.0985 |
| 1.7B | test-clean | 1.99% | 0.61% | 2.43s | 0.2708 |
| 1.7B | test-other | 3.45% | 1.42% | 2.02s | 0.2814 |

Artifacts: `docs/benchmarks/2026-02-15-librispeech-test-clean-100.json`, `docs/benchmarks/2026-02-15-librispeech-test-other-100.json`, `docs/benchmarks/2026-02-15-librispeech-test-clean-100-1p7b.json`, `docs/benchmarks/2026-02-15-librispeech-test-other-100-1p7b.json`.

### Quantization quality (0.6B, LibriSpeech test-clean, 100 speaker-balanced samples)

| Configuration | WER | CER | Mean eval latency | vs fp16 Speed |
|---|---|---|---|---|
| fp16 | 2.29% | 0.59% | 1.09s | — |
| 8-bit (g64) | 2.33% | 0.59% | 0.34s | 3.11x |
| 4-bit (g64) | 2.72% | 0.88% | 0.30s | 4.68x |

8-bit is near-fp16 quality (+0.04pp WER). 4-bit trades +0.43pp WER for maximum speed.

On the harder `test-other` lane (`n=100`, speaker-balanced), 8-bit remains near-fp16
(-0.05pp WER) while 4-bit shows a larger quality tradeoff (+1.38pp WER). Speedups
remain strong (3.66x for 8-bit, 4.37x for 4-bit on the 10s benchmark clip).

Artifact: `docs/benchmarks/2026-02-15-quant-matrix-test-other-speaker100.md`.

### Multilingual quality (FLEURS, 10 languages x 10 samples)

| Model | Primary Error Rate | Mean Latency | Best Languages | Weakest |
|---|---|---|---|---|
| **0.6B** fp16 | 9.37% | 1.44s | Spanish 3.0%, Chinese 4.4%, English 4.6% | Hindi 16.7%, French 18.2%, Arabic 21.5% |
| **1.7B** fp16 | **6.70%** | 4.12s | Spanish 0.7%, Japanese 3.6%, French 4.1% | Chinese 8.5%, Arabic 16.5%, Hindi 17.4% |

The 1.7B delivers a 28% relative improvement, with the biggest gains on French (-14.1pp), Japanese (-4.9pp), and Arabic (-5.0pp). The 1.7B runs ~2.86x slower.

Artifacts: `docs/benchmarks/2026-02-15-manifest-quality-multilingual100-0p6b-refresh.json`, `docs/benchmarks/2026-02-15-manifest-quality-multilingual100-1p7b-refresh.json`.

### MLX vs PyTorch quality (0.6B, Multilingual-100)

| Metric | MLX | PyTorch | Delta |
|---|---:|---:|---:|
| Primary error rate | 9.54% | 10.34% | -0.81pp |
| WER | 16.00% | 16.69% | -0.70pp |
| CER | 5.43% | 5.64% | -0.21pp |

67% of samples produce identical text output. Remaining differences are minor lexical shifts, numeric surface forms (`10,000` vs `zehntausend`), or punctuation — not quality regressions.

On long-form audio (75-90s clips), **MLX is 4.19x faster** than PyTorch on the same machine.

On an expanded real-world mixed lane (AMI IHM meetings + Earnings22 chunked,
`n=200`), MLX remains near parity with PyTorch (**23.23%** vs **23.04%** WER,
`+0.19pp` delta) while staying **3.27x faster** on the same machine
(`1.34s` vs `4.39s` mean latency).

### Optimizations applied

- **Preallocated KV cache** with in-place slice writes and rollback-safe trimming
- **Direct grouped-query fused attention** via `mx.fast.scaled_dot_product_attention` (no explicit K/V head expansion)
- **Hybrid encoder windowing** — dense block-diagonal mask for short audio, segmented per-window execution for long contexts (up to 4.2x faster on long audio)
- **Cached mel filterbank and Hann window** — computed once, reused across calls
- **Native WAV fast-path** — custom binary parser bypasses ffmpeg process startup for PCM/float WAV files (up to 25% faster on quantized short clips)
- **Native in-repo BPE tokenizer** — no `transformers` dependency in runtime transcription path
- **Cached model and tokenizer instances** — repeated `transcribe()` calls skip reload overhead
- **4-bit / 8-bit quantization** — up to 4.7x speed gain with explicit per-profile quality reporting

Full benchmark report: `docs/BENCHMARKS.md`. Latest refresh snapshot: `docs/benchmarks/2026-02-15-quality-matrix-refresh.md`. All benchmark artifacts are committed under `docs/benchmarks/` for reproducibility.

## Model quality

Word error rates from the [Qwen3-ASR technical report](https://arxiv.org/abs/2601.21337) compared against current open-source and proprietary leaders (lower is better):

### English benchmarks

| Benchmark | GPT-4o-Transcribe | Parakeet-TDT-0.6B | Whisper-large-v3 | Qwen3-ASR-0.6B | **Qwen3-ASR-1.7B** |
|---|---:|---:|---:|---:|---:|
| LibriSpeech test-clean | **1.39** | 1.93 | 1.51 | 2.11 | 1.63 |
| LibriSpeech test-other | 3.75 | 3.59 | 3.97 | 4.55 | **3.38** |
| FLEURS-en | 2.40 | 4.85 | 4.08 | 4.39 | **3.35** |
| GigaSpeech | 25.50 | — | 9.76 | 8.88 | **8.45** |

### Chinese + multilingual benchmarks

| Benchmark | GPT-4o-Transcribe | Whisper-large-v3 | Qwen3-ASR-0.6B | **Qwen3-ASR-1.7B** |
|---|---:|---:|---:|---:|
| WenetSpeech test-net | 15.30 | 9.86 | 5.97 | **4.97** |
| AISHELL-2 test | 4.24 | 5.06 | 3.15 | **2.71** |
| FLEURS (12-lang avg) | — | 5.27 | 7.57 | **4.90** |
| CommonVoice | — | 10.77 | 12.75 | **9.18** |

### Robustness benchmarks

| Benchmark | GPT-4o-Transcribe | Whisper-large-v3 | Qwen3-ASR-0.6B | **Qwen3-ASR-1.7B** |
|---|---:|---:|---:|---:|
| Accented English | 28.56 | 21.30 | 16.62 | **16.07** |
| Extreme Noise | 36.11 | 63.17 | 17.88 | **16.17** |
| Elders & Kids (Mandarin) | 14.27 | 10.61 | 4.48 | **3.81** |

GPT-4o-Transcribe leads on clean English read speech (1.39 WER). Parakeet-TDT-0.6B is strong on English. But Qwen3-ASR dominates on Chinese, multilingual, noisy, and accented speech — and is the only open-source model competitive across all categories.

*Parakeet numbers from [model card](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). All other numbers from the Qwen3-ASR paper. Robustness benchmarks are Qwen3-ASR internal test sets.*

### Correctness validation

This implementation is validated against the official PyTorch model via multiple parity gates:

- **MLX vs PyTorch head-to-head** — on the current multilingual-100 artifact, MLX shows lower aggregate primary error than PyTorch (9.54% vs 10.34%)
- **Token-level greedy parity** — current multilingual-100 parity artifact shows 67% exact text match and 64% exact token match across 10 languages; remaining diffs are mostly lexical/numeric surface-form differences
- **Expanded parity suite** — tested across LibriSpeech test-clean, test-other, synthetic long mixes, and noise variants (SNR 10dB, 5dB)
- **Long-form parity** — 10 multilingual clips (75-90s each) transcribed correctly with no chunking artifacts, 4.19x faster than PyTorch
- **Mel spectrogram parity** — custom MLX mel matches HuggingFace WhisperFeatureExtractor with MAE < 3e-7
- **Native aligner parity** — MLX forced aligner matches official `qwen-asr` backend with 100% text match rate, <6ms timing MAE, and 2.64x speed advantage on 50 LibriSpeech samples

## Model variants

| | Qwen3-ASR-0.6B (default) | Qwen3-ASR-1.7B |
|---|---|---|
| **Parameters** | 0.6B | 1.7B |
| **Audio encoder layers** | 18 | 24 |
| **Audio encoder dim** | 896 | 1024 |
| **Text decoder layers** | 28 | 28 |
| **Text hidden size** | 1024 | 2048 |
| **Text attention (Q/KV heads)** | GQA (16/8) | GQA (16/8) |
| **RoPE theta** | 1,000,000 | 1,000,000 |
| **HuggingFace** | `Qwen/Qwen3-ASR-0.6B` | `Qwen/Qwen3-ASR-1.7B` |

Both models use interleaved Multi-dimensional RoPE (MRoPE) with sections [24, 20, 20], 128-bin mel spectrograms, and the same tokenizer (vocabulary size 151,936).

```python
# Default: 0.6B (fast, ~1.2 GB memory)
result = transcribe("audio.wav")

# Accuracy-first: 1.7B (~3.4 GB memory)
result = transcribe("audio.wav", model="Qwen/Qwen3-ASR-1.7B")
```

## Timestamps

Word-level timestamps via forced alignment using a dedicated aligner model (`Qwen/Qwen3-ForcedAligner-0.6B`). This path is native MLX (no PyTorch backend bridge):

```bash
mlx-qwen3-asr audio.wav --timestamps
```

```python
result = transcribe("audio.wav", return_timestamps=True)
for segment in result.segments:
    print(f"{segment['start']:.2f}s - {segment['end']:.2f}s: {segment['text']}")
```

SRT/VTT outputs are grouped into subtitle-friendly phrase segments (not one word per cue).
When `-f srt` or `-f vtt` is requested in offline mode, timestamps are auto-enabled.

**Measured parity** (LibriSpeech test-clean, `n=50`):
| Metric | Value |
|---|---|
| Text match rate (MLX vs official) | 100% |
| Timing MAE (all word boundaries) | 5.69 ms |
| MLX aligner mean latency | 0.21s |
| Official backend mean latency | 0.56s |
| Relative speed | **2.64x faster** |

The aligner uses O(n log n) LIS-based timestamp correction (Fenwick tree) for monotonicity repair, validated against the legacy O(n^2) implementation via randomized parity tests.

For Japanese/Korean timestamp alignment, install the `[aligner]` extra so `nagisa`/`soynlp` tokenization matches the official path.

## Speaker diarization (optional)

Speaker attribution is available as an offline optional path powered by
`pyannote.audio`:

```python
result = transcribe("meeting.wav", diarize=True)
print(result.speaker_segments)
```

```bash
mlx-qwen3-asr meeting.wav --diarize -f json
```

Current status:
- The public API/CLI and output schema are stable.
- The diarization backend is pyannote.audio 4.x (installed via `[diarize]` extra).
- The default model is `pyannote/speaker-diarization-community-1`; accept its
  Hugging Face terms and configure `PYANNOTE_AUTH_TOKEN` (or `HF_TOKEN`).
- `PYANNOTE_MODEL_ID` can point to another pyannote pipeline or a local
  offline clone.
- `--diarize` auto-enables timestamps and is not supported in `--streaming`/`--mic` mode.
- Migration note (2026-02-15): legacy diarization `window/hop` controls were
  removed (`diarization_window_sec`, `diarization_hop_sec`,
  `--diarization-window-sec`, `--diarization-hop-sec`). Speaker-count controls
  remain (`--num-speakers`, `--min-speakers`, `--max-speakers`).

### Diarization setup troubleshooting

1. Install optional diarization dependencies:
   ```bash
   pip install "mlx-qwen3-asr[diarize]"
   ```
2. Accept the default `pyannote/speaker-diarization-community-1` model terms
   on Hugging Face and set a token:
   ```bash
   export PYANNOTE_AUTH_TOKEN=hf_...
   ```
3. Run a quick smoke test:
   ```bash
   mlx-qwen3-asr meeting.wav --diarize -f json
   ```

Common errors and fixes:
- `requires optional dependency 'pyannote.audio'`: install `[diarize]` extra.
- `requires PyTorch via pyannote dependencies`: reinstall `[diarize]` extra in the active environment.
- `Failed to initialize pyannote pipeline ...`: accept model terms on Hugging Face, set `PYANNOTE_AUTH_TOKEN` (or `HF_TOKEN`), and inspect the `Root cause:` details.
- `--streaming does not support --diarize` / `--mic does not support --diarize`: use offline file transcription mode for diarization.

## Quantization

Convert and run a quantized model:

```bash
python scripts/convert.py \
  --model Qwen/Qwen3-ASR-0.6B \
  --quantize 4 --group-size 64 \
  --output-dir ./qwen3-asr-4bit

mlx-qwen3-asr audio.wav --model ./qwen3-asr-4bit
```

Recommended profiles:
- **Speed-first**: 4-bit, group_size=64 — 4.68x faster / +0.43 WER (`test-clean`), 4.37x faster / +1.38 WER (`test-other`)
- **Quality-first**: 8-bit, group_size=64 — 3.11x faster / +0.04 WER (`test-clean`), 3.66x faster / -0.05 WER (`test-other`)

Publish quantized models to HuggingFace:

```bash
HF_TOKEN=... python scripts/publish_quantized.py \
  --source-model Qwen/Qwen3-ASR-0.6B \
  --repo-id YOUR_USER/mlx-qwen3-asr-0.6b-4bit \
  --bits 4
```

## Output formats

```bash
mlx-qwen3-asr audio.wav -f txt           # plain text
mlx-qwen3-asr audio.wav -f srt -o out/   # SRT subtitles
mlx-qwen3-asr audio.wav -f json          # structured JSON
mlx-qwen3-asr audio.wav -f vtt -o out/   # WebVTT
mlx-qwen3-asr *.wav -f all -o out/       # all formats at once
```

Supported: `txt`, `json`, `srt`, `vtt`, `tsv`.
Subtitle formats (`srt`/`vtt`) require timestamp segments and are only supported in offline mode.

## Supported languages

Qwen3-ASR officially lists 30 core languages:

| | | | |
|---|---|---|---|
| Arabic | Cantonese | Chinese | Czech |
| Danish | Dutch | English | Filipino |
| Finnish | French | German | Greek |
| Hindi | Hungarian | Indonesian | Italian |
| Japanese | Korean | Macedonian | Malay |
| Persian | Polish | Portuguese | Romanian |
| Russian | Spanish | Swedish | Thai |
| Turkish | Vietnamese | | |

Plus 22 Chinese dialects (Sichuan, Shanghai, Cantonese, and others), for 52 total language/dialect variants.

Print CLI-accepted aliases/codes:

```bash
mlx-qwen3-asr --list-languages
```

## Experimental features

### Speculative decoding

Uses the 0.6B model as a draft to accelerate 1.7B inference. Currently parity-safe but slower on tested workloads due to draft audio encoder overhead:

```bash
mlx-qwen3-asr audio.wav \
  --model Qwen/Qwen3-ASR-1.7B \
  --draft-model Qwen/Qwen3-ASR-0.6B \
  --num-draft-tokens 4
```

```python
result = transcribe(
    "audio.wav",
    model="Qwen/Qwen3-ASR-1.7B",
    draft_model="Qwen/Qwen3-ASR-0.6B",
    num_draft_tokens=4,
)
```

Status: greedy parity verified, but 0.53-0.55x on short/10s clips. Not enabled by default until benchmark evidence shows net speed wins.

### Domain vocabulary context

When transcribing specialized audio — earnings calls, medical dictation, legal
proceedings — the model can confuse rare terms with more common homophones.
The `context` parameter lets you provide a hint: a string of domain-specific
words or phrases that gets injected into the system prompt, nudging the decoder
toward the correct vocabulary.

This matches the official Qwen3-ASR `context` API. The format is
space-separated terms:

```python
# Finance: avoids "e-bit-da" → "EBITDA", "FX" not "effects", etc.
result = transcribe("earnings-call.wav", context="EBITDA non-GAAP FX hedging")

# Medical
result = transcribe("consult.wav", context="metformin HbA1c nephropathy")

# Also works with streaming
state = init_streaming(context="EBITDA non-GAAP FX hedging")
```

```bash
mlx-qwen3-asr earnings-call.wav --context "EBITDA non-GAAP FX hedging"
```

For batch transcription, pass a list of per-audio context strings:

```python
results = transcribe_batch(
    [audio_en, audio_zh],
    context=["EBITDA non-GAAP", "交易 停滞"],
)
```

When omitted, the system prompt is empty (matching the official default) — no
domain bias is applied.

### Streaming

Rolling decode implementation for near-real-time transcription:

```python
from mlx_qwen3_asr.streaming import (
    init_streaming,
    feed_audio,
    finish_streaming,
    streaming_metrics,
)

state = init_streaming(chunk_size_sec=2.0, max_context_sec=30.0)
for chunk in audio_chunks:
    state = feed_audio(chunk, state)
    print(state.text)
state = finish_streaming(state)
print(streaming_metrics(state))
```

CLI:
```bash
mlx-qwen3-asr --streaming --stream-finalization-mode accuracy audio.wav
# Optional: speech-aware boundary selection near chunk edges
mlx-qwen3-asr --streaming --stream-endpointing-mode energy audio.wav
```

Live microphone transcription:

```bash
mlx-qwen3-asr --mic
mlx-qwen3-asr --mic --language Japanese
```

Optional microphone flags: `--mic-device`, `--mic-duration-sec`, `--mic-sample-rate`.

- Ingests small PCM chunks (default 2s)
- Incremental decoder KV-cache reuse across chunk turns (avoids O(n²) re-transcription)
- Bounded context window (default 30s) for stable memory/runtime
- Prefix rollback controls (`unfixed_chunk_num`, `unfixed_token_num`)
- `stable_text` is monotonic by design: corrections that would shorten already-stable
  prefix text are intentionally not applied to the stable prefix (favoring stability
  over maximal editability in partial output)
- Optional speech-aware endpointing (`endpointing_mode="energy"`) that selects
  low-energy boundaries near chunk edges
- Configurable finalization policy: `finalization_mode="accuracy"` (default) or `"latency"`
- Backward-compatible override: `enable_tail_refine=True|False`
- Input validation: handles int16 PCM normalization, non-1D arrays, empty input

## API reference

### `transcribe(audio, *, model, draft_model, context, language, return_timestamps, diarize, diarization_num_speakers, diarization_min_speakers, diarization_max_speakers, return_chunks, forced_aligner, dtype, max_new_tokens, num_draft_tokens, verbose, on_progress)`

Transcribe audio to text. Accepts a file path, numpy array, `mx.array`, or `(array, sample_rate)` tuple. Returns a `TranscriptionResult`.

`max_new_tokens=None` (default) uses a duration-aware per-chunk decode budget to
avoid runaway generation on noisy inputs that do not emit EOS. Pass an integer
to override the cap explicitly. If you use unusually long custom chunks and see
`truncated=True`, pass a larger explicit value for that workload.

Additional Python entry points:
- `transcribe_batch(audios, ...)` and `transcribe_batch_async(audios, ...)`
- `transcribe_async(audio, ...)`

### `Session(model, *, dtype, tokenizer_model)`

Explicit transcription session. Owns model and tokenizer state with no hidden globals.
- Offline: `session.transcribe(audio, ...)` with the same parameters as top-level `transcribe`.
- Async: `await session.transcribe_async(audio, ...)`.
- Streaming: `session.init_streaming(...)`, `session.feed_audio(pcm, state)`, `session.finish_streaming(state)`.
- Introspection: `session.model_info` (model id/path, dtype, vocab size, model-declared language codes).

### `streaming_metrics(state)`

Return streaming diagnostics for a session state:
- `partial_stability`
- `rewrite_rate`
- `finalization_delta_chars`

### `load_model(name_or_path, *, dtype)`

Load a Qwen3-ASR model and config from HuggingFace or local path. Returns `(model, config)`.

### `load_audio(path_or_url)`

Load and resample audio to mono 16 kHz. Returns an `mx.array`.

### `ForcedAligner(model_path, *, dtype, backend)`

Word-level forced aligner. Native backend: `mlx` (default).

### `TranscriptionResult`

Frozen dataclass:
- `text` (str) — transcribed text
- `language` (str) — detected or forced language (canonicalized names, e.g. `English`)
- `segments` (list[dict] | None) — word-level timestamps when requested: `[{"text": "hello", "start": 0.5, "end": 0.8}, ...]`
- `chunks` (list[dict] | None) — chunk-level transcript and generation metadata when `return_chunks=True`
- `speaker_segments` (list[dict] | None) — speaker-attributed spans when `diarize=True`: `[{"speaker": "SPEAKER_00", "start": 0.0, "end": 2.0, "text": "..."}, ...]`
- `finish_reason` (str | None) — aggregate decode stop reason: `eos`, `repetition`, `length`, or `mixed`
- `truncated` (bool) — true when any chunk exhausted its token budget before EOS/repetition

## Quality gates

This project enforces parity with the official PyTorch implementation. No optimization lands without passing quality gates and committing benchmark artifacts.

```bash
# Unit tests (462 tests)
pytest -q

# Fast quality gate
python scripts/quality_gate.py --mode fast

# Release gate with token-level parity (downloads model weights)
RUN_REFERENCE_PARITY=1 python scripts/quality_gate.py --mode release

# Speaker-balanced WER evaluation (100 samples)
python scripts/eval_librispeech.py --subset test-clean --samples 100 --sampling speaker_round_robin

# Latency benchmark
python scripts/benchmark_asr.py tests/fixtures/test_speech.wav \
  --model Qwen/Qwen3-ASR-0.6B --runs 5 \
  --json-output docs/benchmarks/latest.json
```

Additional quality lanes available:
- **Aligner parity**: `RUN_ALIGNER_PARITY=1` — validates MLX aligner against official backend
- **Expanded parity suite**: `RUN_REFERENCE_PARITY_SUITE=1` — test-clean, test-other, long mixes, noise variants with Unicode-safe text comparison
- **Multilingual parity**: manifest-driven workflow via `scripts/build_multilingual_manifest.py` for cross-language validation
- **Streaming manifest quality**: `RUN_STREAMING_MANIFEST_QUALITY_EVAL=1` with `STREAMING_MANIFEST_QUALITY_EVAL_JSONL=...` — multi-file streaming stability/rewrite/finalization lane via `scripts/eval_streaming_manifest.py`
- **Real-world long-form quality**: `RUN_REALWORLD_LONGFORM_EVAL=1` on full-recording Earnings22 manifests
- **Diarization quality**: `RUN_DIARIZATION_QUALITY_EVAL=1` with `DIARIZATION_QUALITY_EVAL_JSONL=...` — DER/JER lane via `scripts/eval_diarization.py`

See `docs/QUALITY_GATE.md` for full documentation.
Evaluation coverage status and prioritized gaps are tracked in `docs/EVAL_GAPS.md`.

## Architecture overview

```
Audio (16kHz mono)
  → 128-bin log-mel spectrogram (native MLX, Whisper-compatible)
  → Conv2d stem (3 layers, stride 2 each → 8x downsample)
  → Sinusoidal position embeddings
  → Windowed transformer encoder (18 or 24 layers, hybrid dense/segmented attention)
  → LayerNorm + GELU projection → audio features

Chat-template prompt (context is optional domain vocabulary, empty by default):
  <|im_start|>system\n{context}<|im_end|>
  <|im_start|>user\n<|audio_start|><|audio_pad|>*N<|audio_end|><|im_end|>
  <|im_start|>assistant\n

  → Token embedding (151,936 vocab)
  → Replace audio_pad positions with encoded audio features
  → Qwen3 text decoder (28 layers, interleaved MRoPE, SwiGLU, RMSNorm)
  → Autoregressive decode with preallocated KV cache
  → Parse output: "language English<asr_text>transcribed text here"
```

Key architectural details:
- **Interleaved MRoPE** — sections [24, 20, 20] with stride-3 frequency assignment across temporal, height, and width dimensions. This is the detail other MLX ports get wrong (using standard RoPE or chunked assignment).
- **Audio encoder uses LayerNorm + bias** — different from the text decoder which uses RMSNorm without bias.
- **Q/K norms** — RMSNorm applied per-head on queries and keys before attention (Qwen3 innovation).

## Project structure

```
mlx_qwen3_asr/           # 7,602 lines of source
├── transcribe.py         # High-level pipeline + batch/async + diarization (739 lines)
├── cli.py                # CLI entry point and UX guardrails (664 lines)
├── streaming.py          # KV-cache streaming + context trimming (624 lines)
├── tokenizer.py          # Native BPE tokenizer + output parsing (607 lines)
├── diarization.py        # Optional pyannote integration + attribution helpers
├── audio.py              # Mel spectrogram + audio I/O (526 lines)
├── encoder.py            # Audio encoder (512 lines)
├── decoder.py            # Text decoder + KV cache (464 lines)
├── forced_aligner.py     # Forced alignment + LIS correction (439 lines)
├── model.py              # Top-level model + audio-text fusion (372 lines)
├── generate.py           # Autoregressive + speculative decode (350 lines)
├── load_models.py        # Model loading + caching (256 lines)
├── config.py             # Dataclass configs (228 lines)
├── server.py             # HTTP server + OpenAI compat (697 lines)
├── session.py            # Session API (224 lines)
├── writers.py            # Output format writers (221 lines)
├── mrope.py              # Interleaved MRoPE (167 lines)
├── chunking.py           # Long audio splitting (104 lines)
├── attention.py          # Attention utilities (67 lines)
├── convert.py            # Weight remapping (67 lines)
├── eval_metrics.py       # WER/CER/BERTScore helpers (65 lines)
└── cache_utils.py        # KV cache utilities (57 lines)

tests/                    # 7,391 lines, 462 tests
scripts/                  # Benchmarks, evaluation, conversion, publishing
docs/                     # Architecture, decisions, benchmarks, roadmap
docs/benchmarks/          # 160+ committed artifacts for reproducibility
```

## Development

```bash
git clone https://github.com/moona3k/mlx-qwen3-asr.git
cd mlx-qwen3-asr
pip install -e ".[dev]"
pytest -q                 # 462 tests
```

## Acknowledgments

- [Qwen team](https://github.com/QwenLM) at Alibaba for the Qwen3-ASR model
- [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework
- [mlx-whisper](https://github.com/ml-explore/mlx-examples) for architecture patterns and inspiration

## License

Apache 2.0. See [LICENSE](LICENSE) for details.
