Local LLMs: What Your Team Needs to Know
A plain-English guide to running AI models on your own hardware — no cloud required, no data leaving your machine. Written for engineers who haven't dug into this before.
🧠 What is a local LLM?
An LLM (Large Language Model) is the AI engine that powers tools like ChatGPT and Claude. Normally these run on massive server clusters owned by companies like OpenAI or Anthropic — you send your prompt over the internet, their servers think about it, and send back a response.
A local LLM runs that same kind of model directly on your laptop or workstation. No internet required. Your prompts never leave the machine.
The tradeoff is that local models are generally smaller than cloud models — you're limited by how much RAM your machine has. A laptop that can comfortably run a 32-billion-parameter model is excellent; GPT-4 is rumored to be around 1.7 trillion parameters. So we're working with capable but not frontier-sized models.
That said, for everyday coding tasks — reviewing diffs, generating boilerplate, explaining error messages, writing tests — a well-chosen local model is genuinely useful. That's the point of this setup.
📐 Model sizes: what "7B" and "32B" mean
When you see Qwen2.5-Coder-14B-Instruct, the 14B stands for 14 billion parameters.
The neuron analogy
Think of parameters like individual connection strengths in a brain. A human brain has roughly 100 trillion synapses. A 14B model has 14 billion learned numerical weights — each one encoding a tiny piece of knowledge about language, syntax, logic, or code.
More parameters = more capacity to store patterns = generally smarter output. But also: more memory needed, and more computation per token generated.
Parameters are set during training (which you don't do — that's already done by the researchers at Alibaba, Meta, Mistral, etc.). When you download a model, you're downloading those billions of numbers as a file. Running the model is just: feed it your prompt, multiply through all those weights, get a prediction for the next token, repeat.
Size guide
| Size | Example | Capability | Fits on… |
|---|---|---|---|
| 1B–3B | Qwen2.5-Coder-1.5B | Simple autocomplete, single-function edits | Any modern laptop (2–3 GB) |
| 7B–9B | Llama-3.2-8B, Mistral-7B | Basic Q&A, short code generation | 8 GB RAM minimum |
| 14B | Qwen2.5-Coder-14B | Solid everyday coding, most tasks | 16 GB RAM comfortable |
| 32B | Qwen2.5-Coder-32B | Complex code, architecture reasoning | 32 GB RAM (your M5 Mac) |
| 70B+ | Llama-3.3-70B, Qwen2.5-72B | Near-frontier quality | 64–80 GB RAM (Mac Studio / server) |
🗜️ Quantization: 4-bit, 6-bit, 8-bit
Even a 14B model at full precision (16-bit, also called "float16" or "bf16") would need about 28 GB of RAM. That doesn't fit on most laptops. Quantization is how we compress it.
Every parameter in the model is a number. At full precision, each number uses 16 bits of storage. Quantization rounds those numbers to fewer bits — 8, 6, 4, or even 3.
Think of it like image compression. A raw photo (PNG) and a compressed JPEG look almost identical to the eye, but the JPEG is 10× smaller. You lose a tiny bit of quality you can barely notice. Quantizing a model to 4-bit is similar — you're making it dramatically smaller and faster, with a small quality hit that often doesn't matter in practice.
The speed vs. quality tradeoff
Benchmarks above are for Qwen2.5-Coder-32B on the M5 MacBook Pro with 32 GB RAM.
Quick reference
| Format | VRAM (32B model) | Speed on M5 | Quality | Best for |
|---|---|---|---|---|
4-bit |
~18 GB | ~19 tok/s | Good | Everyday subagent tasks |
6-bit |
~25 GB | ~16 tok/s | Very good | Quality + fits in 32 GB ← our pick |
8-bit |
~36 GB | ~12 tok/s | Excellent | Doesn't fit on 32 GB Mac |
16-bit |
~64 GB | ~6 tok/s | Full precision | Research / high-end hardware only |
🏗️ Dense models vs. Mixture of Experts (MoE)
This is one of the most confusing parts of reading model names, and it matters a lot for performance. Two models can both say "35B parameters" but behave completely differently.
Dense models — all hands on deck
In a traditional (dense) model, every single parameter is used for every single token you generate. If the model is 14B parameters, all 14 billion of those weights participate in computing the output for every word.
Input token: "def"
Layer 1: [████████████████████████████████] 14B params active
Layer 2: [████████████████████████████████] 14B params active
Layer 3: [████████████████████████████████] 14B params active
...
Layer 40: [████████████████████████████████] 14B params active
Output: next token prediction
Dense models are predictable: their memory footprint = parameters × bits-per-weight. Their speed scales directly with parameter count. More parameters = slower, but more capable.
Mixture of Experts (MoE) — specialists, not generalists
MoE models are architecturally different. Instead of one big network, they contain many smaller "expert" sub-networks. A routing layer looks at each incoming token and decides: which experts are relevant here? Only a fraction of the total experts activate for any given token.
Input token: "def"
Router decides which experts to use...
Expert 1 (Python): [████████] active
Expert 2 (Math): [········] skipped
Expert 3 (Syntax): [████████] active
Expert 4 (Logic): [········] skipped
Expert 5 (Prose): [········] skipped
Expert 6 (Debug): [████████] active
... (many more experts, most skipped)
Only ~3B params activated, despite 35B total
Output: next token prediction
This is why you see labels like 35B-A3B: the model has 35 billion total parameters, but only activates 3 billion per token. This makes it dramatically faster than a dense 14B model while having the knowledge capacity of a 35B model.
When to use which
| Dense | MoE | |
|---|---|---|
| Speed | Proportional to total params | Proportional to active params (much faster) |
| Memory | Proportional to total params | Proportional to total params (must fit all experts) |
| Quality | Consistent and reliable | High ceiling, but routing can miss on unusual inputs |
| Thinking mode | Usually optional or absent | Qwen3 MoE always thinks — strips output, wastes tokens |
| Best for | Coding subagents, reliable output | Complex reasoning when you have RAM to spare |
Qwen3.6-35B-A3B) have thinking mode permanently baked in. They generate hundreds of tokens of internal reasoning before answering. Our oMLX backend strips the <think> XML tags — but leaves the thinking text. We handle this in the MCP server, but it costs tokens and time. For coding subagent use, we default to the Qwen2.5-Coder dense family instead.
💾 Memory requirements
You can estimate how much RAM a model needs with a simple formula:
RAM (GB) ≈ (parameters in billions) × (bits per weight) ÷ 8 + ~2 GB overhead
Examples:
Qwen2.5-Coder-14B at 4-bit: 14 × 4 ÷ 8 = 7 GB + 2 = ~9 GB
Qwen2.5-Coder-32B at 4-bit: 32 × 4 ÷ 8 = 16 GB + 2 = ~18 GB
Qwen2.5-Coder-32B at 6-bit: 32 × 6 ÷ 8 = 24 GB + 2 = ~26 GB
Qwen2.5-Coder-32B at 8-bit: 32 × 8 ÷ 8 = 32 GB + 2 = ~34 GB ← won't fit in 32 GB
Qwen2.5-Coder-72B at 4-bit: 72 × 4 ÷ 8 = 36 GB + 2 = ~38 GB ← won't fit in 32 GB
For MoE models, the memory requirement is based on total parameters (not active), because all the experts' weights need to be resident in memory even if only a few are active per token.
⚡ Speed: what "tok/s" means and why it matters
tok/s means "tokens per second" — how fast the model generates output. A token is roughly 0.75 words, so 20 tok/s ≈ 15 words per second, which feels instant. 5 tok/s ≈ 4 words per second, which starts to feel slow for interactive use.
| Speed | Feel | Good for |
|---|---|---|
| 30+ tok/s | Instant — feels like autocomplete | Interactive, short responses |
| 18–30 tok/s | Fast — comfortable for reading | Code generation, our target range |
| 8–17 tok/s | Moderate — noticeable lag on long output | Batch tasks, not interactive |
| <8 tok/s | Slow — 200 tokens takes 25+ seconds | Background jobs only |
Speed is primarily determined by memory bandwidth — how fast the CPU/GPU can read the model weights from RAM. This is why Apple Silicon is exceptional for local inference: unified memory means the GPU and CPU share the same RAM pool with extremely high bandwidth.
🍎 Why Apple Silicon is great for this
Most machines have separate CPU RAM and GPU VRAM. A GPU with 16 GB VRAM can only load a model that fits in those 16 GB — even if your machine has 64 GB of system RAM, the GPU can't use it.
Apple Silicon (M1, M2, M3, M4, M5) uses unified memory architecture (UMA): the CPU, GPU, and Neural Engine all share the same physical RAM pool. This means:
- A MacBook Pro with 32 GB RAM can load a model that uses all 32 GB
- The GPU has full bandwidth to that entire pool
- No expensive PCIe bus transfer between CPU and GPU — it's all on one chip
| Chip | Memory Bandwidth | 32B 4-bit speed | Notes |
|---|---|---|---|
| M1 Pro | 200 GB/s | ~10 tok/s | First gen UMA |
| M2 Pro | 200 GB/s | ~12 tok/s | Modest improvement |
| M3 Pro | 150 GB/s | ~10 tok/s | Base M3 actually slower |
| M4 Pro | 273 GB/s | ~16 tok/s | Big jump |
| M5 (base) | 153 GB/s | ~19 tok/s | Our machine |
| M5 Pro / Max | 273–500 GB/s | ~25–40 tok/s | Team hardware future |
The M5 base chip's 153 GB/s is 2× faster than M1, explaining why the same 32B model that ran at ~10 tok/s on M1 runs at ~19 tok/s on your machine.
🗂️ Models we use
We use the Qwen2.5-Coder family, built by Alibaba's Qwen team. It's specifically trained on code — not just general text — making it significantly better at our use cases than general-purpose models of the same size.
Switching models
You can switch models at runtime — no Claude Code restart needed:
- Interactive: Type
/switch-modelin any Claude Code session — you'll get a numbered menu with descriptions and tok/s estimates - Direct: Call the
set_modelMCP tool with a name or fragment (e.g.set_model("32b")) - The choice persists across restarts via
~/.config/mlx-mcp/active_model
🧭 Decision guide
Use this when picking which model to load for a task:
When to use which model
- Quick autocomplete / single-function edits → 14B 4-bit (fastest)
- Everyday coding tasks → 14B 8-bit (balanced)
- Complex refactoring, architecture review → 32B 4-bit or 6-bit
- Best possible quality, no speed pressure → 32B 6-bit
- Multiple models loaded → oMLX can only load one at a time; switch via
/switch-model - Model isn't responding / connection drops → it's probably loading a new model; retry in 10–20s
- Getting garbage output → run
quick_test helloto sanity-check; may need to switch models - Weird "Thinking Process:" text in output → you're on a Qwen3 model; switch to Qwen2.5-Coder
📖 Glossary
chat, set_model, and quick_test.mlx-community/<model-name>.