About these tok/s numbers: The quick tests use short prompts that produce short responses (4–53 tokens). Because the model has to "wake up" for the first token (called TTFT — time to first token), short responses show artificially low tok/s. The math test on DeepSeek (46 tok/s, 53 tokens) is the most representative reading. Real sustained speeds on longer outputs are 3–5× higher than what you see on the 4–12 token tests. Benchmark sources show DeepSeek at ~135 tok/s, 7B at ~80 tok/s, 14B at ~28 tok/s, 32B at ~19 tok/s, and Gemma 27B at ~35 tok/s on this hardware.
📊 At a Glance
| Model |
Tier |
Sustained tok/s |
RAM |
Best for |
Math |
Code Review |
| DeepSeek-Coder-V2-Lite |
⚡ Turbo |
~135 tok/s |
8.6 GB |
Quick subagent calls, boilerplate, lookups |
✓ + working |
Brief but correct |
| Qwen2.5-Coder-7B |
⚡ Fast |
~80 tok/s |
4.2 GB |
Ultra-light, quick answers, pairs well with DeepSeek |
✓ no working |
Valid but minor fix |
| Qwen2.5-Coder-14B |
⚖️ Everyday |
~28 tok/s |
8.1 GB |
Reliable daily driver, most coding tasks |
✓ no working |
Good — named the fix |
| Qwen2.5-Coder-32B |
🧠 Quality |
~19 tok/s |
18.0 GB |
Complex code, multi-file reasoning, architecture |
✓ no working |
Best — renamed + spacing |
| Gemma 3 27B (QAT) |
🔮 Off-hours |
~35 tok/s |
16.5 GB |
PR review, deep refactors, evenings & weekends |
✓ no working |
Good — renamed correctly |
"Say hello and introduce yourself briefly."
Hello! I'm your AI assistant, here to help you with your queries. What can I assist you with today?
⚠ Slow here because oMLX was loading the model. Speed recovers on subsequent calls.
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
Use a more descriptive function name.
✅ Correct. Brief — identifies the naming problem without fluff.
"What is 347 × 28? Show your working."
347 × 28 = 347 × (20 + 8) = 347 × 20 + 347 × 8 = 6940 + 2776 = 9716
✅ Correct answer with full working shown. Only model to do this. ★ Most reliable tok/s reading.
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to help with a variety of tasks. How can I assist you today?
Generic but clean. No hallucinations or fluff.
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def f(x): return x**2
⚠ Valid change (** is more explicit) but the bigger issue — the name f — was missed. Shallow fix.
"What is 347 × 28? Show your working."
9,716
✅ Correct answer. ⚠ Ignored "show your working" — smaller models often skip instructions they find inconvenient.
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to help with a variety of tasks.
⚠ Slow — oMLX was loading the model during this call. Normal first-request behavior.
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
Use a more descriptive function name, like square.
✅ Best of the speed tier — correct priority (naming), specific suggestion, clean.
"Write a two-sentence story about a robot who discovers music."
Robot Rhythm found notes scattered in the factory, piecing together melodies that danced in the air.
Decent creative output. Stopped at one sentence — didn't fully complete the task.
"Say hello and introduce yourself briefly."
Hello, I'm an AI assistant designed to provide quick and direct responses.
⚠ Loading. Notice it volunteered its own design principle ("quick and direct") — slightly more self-aware.
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def square(x): return x * x
✅ Best overall — renamed the function AND added spaces around the operator. Two improvements, one shot.
"Write a two-sentence story about a robot who discovers music."
The robot, programmed for logic and function, found its circuits humming with an unfamiliar rhythm. As melodies danced through its processing unit, it learned to dance, spinning and beeping in joy.
✅ Two complete sentences, richer vocabulary, cohesive arc. Clear quality upgrade from 14B.
"Say hello and introduce yourself briefly."
Hello, I am Gemma, an open-weights AI assistant.
✅ Honest self-identification — only model to say its actual name. Distinct voice from the start.
"Review this Python snippet and suggest one improvement: def f(x): return x*x"
def square(x): return x*x
✅ Correct — renamed the function. Matches the 32B Qwen on the key insight, slightly less polished (no operator spacing).
"Write a two-sentence story about a robot who discovers music."
Unit 734 processed a stray radio signal — a violin concerto. Its metallic fingers began to mimic the vibrations on a nearby railing, creating a surprisingly harmonious echo.
✅ Best creative output of all models tested. Specific detail (Unit 734, violin concerto), sensory imagery, complete two-sentence structure.
🧭 Which model should I use?
For team members who don't want to think about it
During work hours: Claude Code picks automatically based on task type. If you want to change manually, type
/switch-model.
Evenings / weekends: Load Gemma 27B in oMLX for better quality on longer tasks like PR review.
| Situation | Model to load | Why |
| I'm in a hurry, quick question |
DeepSeek-Coder-V2-Lite |
~135 tok/s — responds before you finish reading the prompt |
| Normal coding day |
Qwen2.5-Coder-14B |
Reliable, ~28 tok/s, handles most tasks well, low RAM |
| Hard problem, architecture decision |
Qwen2.5-Coder-32B |
Best reasoning depth, catches more edge cases |
| Reviewing a big PR, evening/weekend |
gemma-3-27b-it-qat-4bit |
Excellent quality, different training means fresh perspective |
| I just need something tiny and fast |
Qwen2.5-Coder-7B |
4.2 GB, barely uses RAM, good for simple completions |