Live Benchmark Results

Local LLM Model Benchmark

M5 MacBook Pro · 32 GB unified memory · oMLX inference server · June 2026
About these tok/s numbers: The quick tests use short prompts that produce short responses (4–53 tokens). Because the model has to "wake up" for the first token (called TTFT — time to first token), short responses show artificially low tok/s. The math test on DeepSeek (46 tok/s, 53 tokens) is the most representative reading. Real sustained speeds on longer outputs are 3–5× higher than what you see on the 4–12 token tests. Benchmark sources show DeepSeek at ~135 tok/s, 7B at ~80 tok/s, 14B at ~28 tok/s, 32B at ~19 tok/s, and Gemma 27B at ~35 tok/s on this hardware.

📊 At a Glance

Model Tier Sustained tok/s RAM Best for Math Code Review
DeepSeek-Coder-V2-Lite ⚡ Turbo ~135 tok/s 8.6 GB Quick subagent calls, boilerplate, lookups ✓ + working Brief but correct
Qwen2.5-Coder-7B ⚡ Fast ~80 tok/s 4.2 GB Ultra-light, quick answers, pairs well with DeepSeek ✓ no working Valid but minor fix
Qwen2.5-Coder-14B ⚖️ Everyday ~28 tok/s 8.1 GB Reliable daily driver, most coding tasks ✓ no working Good — named the fix
Qwen2.5-Coder-32B 🧠 Quality ~19 tok/s 18.0 GB Complex code, multi-file reasoning, architecture ✓ no working Best — renamed + spacing
Gemma 3 27B (QAT) 🔮 Off-hours ~35 tok/s 16.5 GB PR review, deep refactors, evenings & weekends ✓ no working Good — renamed correctly

⚡ Turbo Tier · MoE Architecture
DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx
The speed king. Uses Mixture-of-Experts architecture — 16B total parameters but only 2.4B activate per token, giving it the speed of a tiny model with the knowledge of a large one. No thinking mode. Ideal for the high-frequency subagent calls Claude makes during your work session: quick lookups, boilerplate generation, explaining a function, checking syntax. The math test is the clearest window into real performance: 46 tok/s on a 53-token response. For longer outputs, expect 100–135 tok/s.
~135 tok/s 8.6 GB No thinking mode
👋 Hello
7.3 tok/s 25 tok · 3.4s
"Say hello and introduce yourself briefly."
Hello! I'm your AI assistant, here to help you with your queries. What can I assist you with today?
⚠ Slow here because oMLX was loading the model. Speed recovers on subsequent calls.
🔍 Code Review
12.8 tok/s 7 tok · 0.55s
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
Use a more descriptive function name.
✅ Correct. Brief — identifies the naming problem without fluff.
🔢 Math
46.1 tok/s ★ 53 tok · 1.15s
"What is 347 × 28? Show your working."
347 × 28 = 347 × (20 + 8) = 347 × 20 + 347 × 8 = 6940 + 2776 = 9716
✅ Correct answer with full working shown. Only model to do this. ★ Most reliable tok/s reading.
⚡ Fast Tier · Dense · 4-bit
Qwen2.5-Coder-7B-Instruct-4bit
The smallest of the Qwen2.5-Coder family — same training lineage as the 14B and 32B, just compressed. At 4.2 GB it barely touches RAM and loads almost instantly. A good backup to DeepSeek when you want something with slightly more consistent code reasoning than the MoE architecture. The code review suggestion (x**2) is technically valid (more explicit intent) but misses the bigger win of renaming the function — smaller models tend to pick shallower improvements.
~80 tok/s 4.2 GB No thinking mode
👋 Hello
9.2 tok/s 23 tok · 2.5s
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to help with a variety of tasks. How can I assist you today?
Generic but clean. No hallucinations or fluff.
🔍 Code Review
11.9 tok/s 10 tok · 0.84s
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def f(x): return x**2
⚠ Valid change (** is more explicit) but the bigger issue — the name f — was missed. Shallow fix.
🔢 Math
10.0 tok/s 5 tok · 0.5s
"What is 347 × 28? Show your working."
9,716
✅ Correct answer. ⚠ Ignored "show your working" — smaller models often skip instructions they find inconvenient.
⚖️ Everyday Tier · Dense · 4-bit
Qwen2.5-Coder-14B-Instruct-4bit
The reliable workhorse. Good at most coding tasks, fast enough for interactive use, small enough to not stress your RAM. The code review response — "Use a more descriptive function name, like square" — is noticeably better than the 7B: it names the fix specifically rather than making a shallow syntax change. A solid default for anyone who isn't sure which model to use.
~28 tok/s 8.1 GB No thinking mode
👋 Hello
0.7 tok/s (loading) 16 tok · 23s
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to help with a variety of tasks.
⚠ Slow — oMLX was loading the model during this call. Normal first-request behavior.
🔍 Code Review
7.0 tok/s 11 tok · 1.6s
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
Use a more descriptive function name, like square.
✅ Best of the speed tier — correct priority (naming), specific suggestion, clean.
✍️ Creative
9.5 tok/s 20 tok · 2.1s
"Write a two-sentence story about a robot who discovers music."
Robot Rhythm found notes scattered in the factory, piecing together melodies that danced in the air.
Decent creative output. Stopped at one sentence — didn't fully complete the task.
🧠 Quality Tier · Dense · 4-bit
Qwen2.5-Coder-32B-Instruct-4bit
The heaviest daily-use model — 18 GB, roughly half your RAM. Noticeably better at multi-step reasoning and understanding what you actually want vs. what you literally said. The code review response is the cleanest of all models tested: it both renamed the function and added proper spacing around the operator without being asked. The creative test also shows a clear quality jump — richer language, two complete sentences, more cohesive imagery. Use this when the task is complex or the output needs to be right the first time.
~19 tok/s 18.0 GB No thinking mode
👋 Hello
1.7 tok/s (loading) 15 tok · 9s
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to provide quick and direct responses.
⚠ Loading. Notice it volunteered its own design principle ("quick and direct") — slightly more self-aware.
🔍 Code Review
3.5 tok/s 10 tok · 2.8s
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def square(x): return x * x
✅ Best overall — renamed the function AND added spaces around the operator. Two improvements, one shot.
✍️ Creative
5.5 tok/s 38 tok · 6.9s
"Write a two-sentence story about a robot who discovers music."
The robot, programmed for logic and function, found its circuits humming with an unfamiliar rhythm. As melodies danced through its processing unit, it learned to dance, spinning and beeping in joy.
✅ Two complete sentences, richer vocabulary, cohesive arc. Clear quality upgrade from 14B.
🔮 Off-Hours Tier · Dense · QAT 4-bit · Google DeepMind
gemma-3-27b-it-qat-4bit
The wild card. Different company (Google DeepMind), different training data, different strengths. At 16.5 GB it's smaller than the 32B Qwen but surprisingly competitive on quality — and at ~35 tok/s it's faster too. The "QAT" variant (Quantization-Aware Training) means the model was trained knowing it would be compressed, giving better quality at the same 4-bit size. The standout test result is the creative writing: "Unit 734 processed a stray radio signal — a violin concerto" is a different voice entirely. Diversity of training lineage is a feature: it will catch things Qwen misses, and miss things Qwen catches.
~35 tok/s 16.5 GB No thinking mode
👋 Hello
1.3 tok/s (loading) 13 tok · 10s
"Say hello and introduce yourself briefly."
Hello, I am Gemma, an open-weights AI assistant.
✅ Honest self-identification — only model to say its actual name. Distinct voice from the start.
🔍 Code Review
5.0 tok/s 12 tok · 2.4s
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def square(x): return x*x
✅ Correct — renamed the function. Matches the 32B Qwen on the key insight, slightly less polished (no operator spacing).
✍️ Creative ★
6.3 tok/s 34 tok · 5.4s
"Write a two-sentence story about a robot who discovers music."
Unit 734 processed a stray radio signal — a violin concerto. Its metallic fingers began to mimic the vibrations on a nearby railing, creating a surprisingly harmonious echo.
✅ Best creative output of all models tested. Specific detail (Unit 734, violin concerto), sensory imagery, complete two-sentence structure.

🧭 Which model should I use?

For team members who don't want to think about it
During work hours: Claude Code picks automatically based on task type. If you want to change manually, type /switch-model.
Evenings / weekends: Load Gemma 27B in oMLX for better quality on longer tasks like PR review.
SituationModel to loadWhy
I'm in a hurry, quick question DeepSeek-Coder-V2-Lite ~135 tok/s — responds before you finish reading the prompt
Normal coding day Qwen2.5-Coder-14B Reliable, ~28 tok/s, handles most tasks well, low RAM
Hard problem, architecture decision Qwen2.5-Coder-32B Best reasoning depth, catches more edge cases
Reviewing a big PR, evening/weekend gemma-3-27b-it-qat-4bit Excellent quality, different training means fresh perspective
I just need something tiny and fast Qwen2.5-Coder-7B 4.2 GB, barely uses RAM, good for simple completions