Team Knowledge Base · Draft

Local LLMs: What Your Team Needs to Know

A plain-English guide to running AI models on your own hardware — no cloud required, no data leaving your machine. Written for engineers who haven't dug into this before.

🧠 What is a local LLM?

An LLM (Large Language Model) is the AI engine that powers tools like ChatGPT and Claude. Normally these run on massive server clusters owned by companies like OpenAI or Anthropic — you send your prompt over the internet, their servers think about it, and send back a response.

A local LLM runs that same kind of model directly on your laptop or workstation. No internet required. Your prompts never leave the machine.

Why does this matter for us?
Three reasons: privacy (code never leaves the machine), cost (no per-token API bill once the hardware is there), and latency (no network hop — the model responds in milliseconds, not seconds).

The tradeoff is that local models are generally smaller than cloud models — you're limited by how much RAM your machine has. A laptop that can comfortably run a 32-billion-parameter model is excellent; GPT-4 is rumored to be around 1.7 trillion parameters. So we're working with capable but not frontier-sized models.

That said, for everyday coding tasks — reviewing diffs, generating boilerplate, explaining error messages, writing tests — a well-chosen local model is genuinely useful. That's the point of this setup.

📐 Model sizes: what "7B" and "32B" mean

When you see Qwen2.5-Coder-14B-Instruct, the 14B stands for 14 billion parameters.

The neuron analogy

Analogy

Think of parameters like individual connection strengths in a brain. A human brain has roughly 100 trillion synapses. A 14B model has 14 billion learned numerical weights — each one encoding a tiny piece of knowledge about language, syntax, logic, or code.

More parameters = more capacity to store patterns = generally smarter output. But also: more memory needed, and more computation per token generated.

Parameters are set during training (which you don't do — that's already done by the researchers at Alibaba, Meta, Mistral, etc.). When you download a model, you're downloading those billions of numbers as a file. Running the model is just: feed it your prompt, multiply through all those weights, get a prediction for the next token, repeat.

Size guide

Size Example Capability Fits on…
1B–3B Qwen2.5-Coder-1.5B Simple autocomplete, single-function edits Any modern laptop (2–3 GB)
7B–9B Llama-3.2-8B, Mistral-7B Basic Q&A, short code generation 8 GB RAM minimum
14B Qwen2.5-Coder-14B Solid everyday coding, most tasks 16 GB RAM comfortable
32B Qwen2.5-Coder-32B Complex code, architecture reasoning 32 GB RAM (your M5 Mac)
70B+ Llama-3.3-70B, Qwen2.5-72B Near-frontier quality 64–80 GB RAM (Mac Studio / server)
Bigger ≠ always better for your use case
A 14B model that answers in 1 second is more useful as a coding subagent than a 70B model that takes 10 seconds. Match the model to the task — use small/fast for quick lookups, large/slow for deep reasoning.

🗜️ Quantization: 4-bit, 6-bit, 8-bit

Even a 14B model at full precision (16-bit, also called "float16" or "bf16") would need about 28 GB of RAM. That doesn't fit on most laptops. Quantization is how we compress it.

Every parameter in the model is a number. At full precision, each number uses 16 bits of storage. Quantization rounds those numbers to fewer bits — 8, 6, 4, or even 3.

Analogy

Think of it like image compression. A raw photo (PNG) and a compressed JPEG look almost identical to the eye, but the JPEG is 10× smaller. You lose a tiny bit of quality you can barely notice. Quantizing a model to 4-bit is similar — you're making it dramatically smaller and faster, with a small quality hit that often doesn't matter in practice.

The speed vs. quality tradeoff

4-bit speed
~19 tok/s
4-bit quality
Good

6-bit speed
~16 tok/s
6-bit quality
Very good

8-bit speed
~12 tok/s
8-bit quality
Excellent

16-bit speed
~6 tok/s
16-bit quality
Full precision

Benchmarks above are for Qwen2.5-Coder-32B on the M5 MacBook Pro with 32 GB RAM.

Quick reference

Format VRAM (32B model) Speed on M5 Quality Best for
4-bit ~18 GB ~19 tok/s Good Everyday subagent tasks
6-bit ~25 GB ~16 tok/s Very good Quality + fits in 32 GB ← our pick
8-bit ~36 GB ~12 tok/s Excellent Doesn't fit on 32 GB Mac
16-bit ~64 GB ~6 tok/s Full precision Research / high-end hardware only
Rule of thumb
For a team running models on 32 GB laptops: 4-bit for speed, 6-bit for quality. Both fit comfortably. 8-bit and above require more RAM than the machine has.

🏗️ Dense models vs. Mixture of Experts (MoE)

This is one of the most confusing parts of reading model names, and it matters a lot for performance. Two models can both say "35B parameters" but behave completely differently.

Dense models — all hands on deck

In a traditional (dense) model, every single parameter is used for every single token you generate. If the model is 14B parameters, all 14 billion of those weights participate in computing the output for every word.

Dense model (14B)

Input token: "def"

Layer 1: [████████████████████████████████] 14B params active
Layer 2: [████████████████████████████████] 14B params active
Layer 3: [████████████████████████████████] 14B params active
...
Layer 40: [████████████████████████████████] 14B params active

Output: next token prediction

Dense models are predictable: their memory footprint = parameters × bits-per-weight. Their speed scales directly with parameter count. More parameters = slower, but more capable.

Mixture of Experts (MoE) — specialists, not generalists

MoE models are architecturally different. Instead of one big network, they contain many smaller "expert" sub-networks. A routing layer looks at each incoming token and decides: which experts are relevant here? Only a fraction of the total experts activate for any given token.

MoE model (35B total, 3B active — "A3B")

Input token: "def"

Router decides which experts to use...

Expert 1 (Python): [████████] active
Expert 2 (Math): [········] skipped
Expert 3 (Syntax): [████████] active
Expert 4 (Logic): [········] skipped
Expert 5 (Prose): [········] skipped
Expert 6 (Debug): [████████] active
... (many more experts, most skipped)

Only ~3B params activated, despite 35B total

Output: next token prediction

This is why you see labels like 35B-A3B: the model has 35 billion total parameters, but only activates 3 billion per token. This makes it dramatically faster than a dense 14B model while having the knowledge capacity of a 35B model.

This is why the Qwen3-35B ran at 48 tok/s
A dense 35B model would run at maybe 6 tok/s on this hardware. The MoE version activates only 3B parameters per token — so it runs at the speed of a ~3B model with the quality ceiling of a 35B. That's the magic of MoE architecture.

When to use which

Dense MoE
Speed Proportional to total params Proportional to active params (much faster)
Memory Proportional to total params Proportional to total params (must fit all experts)
Quality Consistent and reliable High ceiling, but routing can miss on unusual inputs
Thinking mode Usually optional or absent Qwen3 MoE always thinks — strips output, wastes tokens
Best for Coding subagents, reliable output Complex reasoning when you have RAM to spare
MoE gotcha on our stack
The Qwen3 MoE models (like Qwen3.6-35B-A3B) have thinking mode permanently baked in. They generate hundreds of tokens of internal reasoning before answering. Our oMLX backend strips the <think> XML tags — but leaves the thinking text. We handle this in the MCP server, but it costs tokens and time. For coding subagent use, we default to the Qwen2.5-Coder dense family instead.

💾 Memory requirements

You can estimate how much RAM a model needs with a simple formula:

Memory estimate formula

RAM (GB) ≈ (parameters in billions) × (bits per weight) ÷ 8 + ~2 GB overhead

Examples:
Qwen2.5-Coder-14B at 4-bit: 14 × 4 ÷ 8 = 7 GB + 2 = ~9 GB
Qwen2.5-Coder-32B at 4-bit: 32 × 4 ÷ 8 = 16 GB + 2 = ~18 GB
Qwen2.5-Coder-32B at 6-bit: 32 × 6 ÷ 8 = 24 GB + 2 = ~26 GB
Qwen2.5-Coder-32B at 8-bit: 32 × 8 ÷ 8 = 32 GB + 2 = ~34 GB ← won't fit in 32 GB
Qwen2.5-Coder-72B at 4-bit: 72 × 4 ÷ 8 = 36 GB + 2 = ~38 GB ← won't fit in 32 GB
Leave headroom for the OS
macOS takes 4–6 GB of RAM on its own, plus whatever other apps are open. On a 32 GB machine, keep models under ~26 GB so the system stays responsive.

For MoE models, the memory requirement is based on total parameters (not active), because all the experts' weights need to be resident in memory even if only a few are active per token.

Speed: what "tok/s" means and why it matters

tok/s means "tokens per second" — how fast the model generates output. A token is roughly 0.75 words, so 20 tok/s ≈ 15 words per second, which feels instant. 5 tok/s ≈ 4 words per second, which starts to feel slow for interactive use.

Speed Feel Good for
30+ tok/s Instant — feels like autocomplete Interactive, short responses
18–30 tok/s Fast — comfortable for reading Code generation, our target range
8–17 tok/s Moderate — noticeable lag on long output Batch tasks, not interactive
<8 tok/s Slow — 200 tokens takes 25+ seconds Background jobs only

Speed is primarily determined by memory bandwidth — how fast the CPU/GPU can read the model weights from RAM. This is why Apple Silicon is exceptional for local inference: unified memory means the GPU and CPU share the same RAM pool with extremely high bandwidth.

Why small models are sometimes faster than you'd expect
LLM inference is memory-bandwidth bound, not compute bound. A 7B model at 30 tok/s and a 32B model at 19 tok/s might use the same amount of raw GPU compute — the bottleneck is just reading the weights off the chip fast enough. This is why memory bandwidth specs matter more than TFLOPS for local inference.

🍎 Why Apple Silicon is great for this

Most machines have separate CPU RAM and GPU VRAM. A GPU with 16 GB VRAM can only load a model that fits in those 16 GB — even if your machine has 64 GB of system RAM, the GPU can't use it.

Apple Silicon (M1, M2, M3, M4, M5) uses unified memory architecture (UMA): the CPU, GPU, and Neural Engine all share the same physical RAM pool. This means:

  • A MacBook Pro with 32 GB RAM can load a model that uses all 32 GB
  • The GPU has full bandwidth to that entire pool
  • No expensive PCIe bus transfer between CPU and GPU — it's all on one chip
Chip Memory Bandwidth 32B 4-bit speed Notes
M1 Pro 200 GB/s ~10 tok/s First gen UMA
M2 Pro 200 GB/s ~12 tok/s Modest improvement
M3 Pro 150 GB/s ~10 tok/s Base M3 actually slower
M4 Pro 273 GB/s ~16 tok/s Big jump
M5 (base) 153 GB/s ~19 tok/s Our machine
M5 Pro / Max 273–500 GB/s ~25–40 tok/s Team hardware future

The M5 base chip's 153 GB/s is 2× faster than M1, explaining why the same 32B model that ran at ~10 tok/s on M1 runs at ~19 tok/s on your machine.

🗂️ Models we use

We use the Qwen2.5-Coder family, built by Alibaba's Qwen team. It's specifically trained on code — not just general text — making it significantly better at our use cases than general-purpose models of the same size.

Qwen2.5-Coder-14B-Instruct-4bit
Speed Tier
Dense, 14B, 4-bit quantized. Fast enough for interactive use. Good for quick lookups, simple completions, and high-frequency subagent calls.
~28 tok/s ~9 GB
🎯
Qwen2.5-Coder-14B-Instruct-8bit
Balanced Tier
Same architecture, higher-precision weights. Noticeably better code quality than 4-bit, still comfortable speed. A good middle ground.
~18 tok/s ~15 GB
🧠
Qwen2.5-Coder-32B-Instruct-4bit
Quality Tier
32B dense model at 4-bit. Substantially better at complex code, architecture reasoning, and multi-step problems. Fits in 32 GB RAM.
~19 tok/s ~18 GB
🏆
Qwen2.5-Coder-32B-Instruct-6bit
Best Quality (current default)
32B dense model at 6-bit. Higher weight precision than 4-bit means fewer hallucinations and more accurate multi-step code. Still fits on 32 GB.
~16 tok/s ~25 GB
Why not Qwen3 MoE?
The Qwen3.6-35B-A3B is genuinely fast (~48 tok/s) and capable — but its thinking mode generates hundreds of tokens of internal monologue we have to strip. It also reports the wrong model name in responses (a bug in our oMLX backend). Until those issues are solved, we stick with the Qwen2.5-Coder series for reliability.

Switching models

You can switch models at runtime — no Claude Code restart needed:

  • Interactive: Type /switch-model in any Claude Code session — you'll get a numbered menu with descriptions and tok/s estimates
  • Direct: Call the set_model MCP tool with a name or fragment (e.g. set_model("32b"))
  • The choice persists across restarts via ~/.config/mlx-mcp/active_model

🧭 Decision guide

Use this when picking which model to load for a task:

When to use which model

  • Quick autocomplete / single-function edits → 14B 4-bit (fastest)
  • Everyday coding tasks → 14B 8-bit (balanced)
  • Complex refactoring, architecture review → 32B 4-bit or 6-bit
  • Best possible quality, no speed pressure → 32B 6-bit
  • Multiple models loaded → oMLX can only load one at a time; switch via /switch-model
  • Model isn't responding / connection drops → it's probably loading a new model; retry in 10–20s
  • Getting garbage output → run quick_test hello to sanity-check; may need to switch models
  • Weird "Thinking Process:" text in output → you're on a Qwen3 model; switch to Qwen2.5-Coder
The speed rule of thumb
If a response is going to be under ~200 tokens (a short function, a one-liner explanation), nearly any model feels fast enough. If you need 500+ tokens (a full class, a long explanation), tok/s matters — use a faster model or accept the wait.

📖 Glossary

Parameters / Weights
The billions of numerical values that define a model's behavior. Set during training, frozen when you download the model. More = higher capacity.
Quantization
Compressing a model by storing each weight with fewer bits (4, 6, or 8 instead of 16). Reduces file size and RAM use at a small quality cost.
Dense model
A model where all parameters activate for every token. Predictable, reliable. Speed is proportional to total parameter count.
MoE (Mixture of Experts)
Architecture with many specialist sub-networks ("experts"). Only a fraction activate per token. Fast (only active params matter for speed) but requires all weights in RAM.
tok/s (tokens per second)
Generation speed. ~0.75 words per token. 20 tok/s ≈ 15 words/sec, which feels instant. Below 8 tok/s starts to feel slow for interactive use.
Unified Memory (UMA)
Apple Silicon's architecture where CPU and GPU share the same RAM pool. Allows running large models on laptops that would need a dedicated GPU on any other platform.
oMLX
Our local inference server. Runs on port 8000, exposes an OpenAI-compatible API. Manages model loading/unloading and serves requests from Claude Code via MCP.
MCP (Model Context Protocol)
Anthropic's protocol for giving Claude Code access to external tools. Our MCP server bridges Claude Code to oMLX, exposing tools like chat, set_model, and quick_test.
Thinking mode
A feature in Qwen3 models where the model generates an internal reasoning chain before answering. Improves quality on hard problems but costs tokens and time. Can't be fully disabled on Qwen3 via oMLX.
Context window
The maximum number of tokens a model can "see" at once (prompt + response combined). Qwen2.5-Coder-32B has a 128K context window — plenty for large code files.
Instruct model
A model fine-tuned to follow instructions and have conversations. Opposed to a "base" model, which just predicts the next token. We always use Instruct variants.
mlx-community
A Hugging Face organization that publishes MLX-optimized model files. When you download a model in oMLX, it typically comes from mlx-community/<model-name>.