Choose an LLM provider
Cantrip supports seven LLM providers. This guide helps you pick the right one for your situation.
Provider comparison
| Provider | API key needed | Best for | Cost |
|---|---|---|---|
| Inference snap | No | Air-gapped environments, privacy, Canonical-native stack | Free (local GPU) |
| Gemini (default) | Yes (GEMINI_API_KEY) |
General use, generous free tier | Free tier available |
| Claude | Yes (ANTHROPIC_API_KEY) |
Best output quality, complex charms | Paid |
| Fireworks.ai | Yes (FIREWORKS_API_KEY) |
Open-weights models (Kimi, GLM, DeepSeek) with tool use | Paid |
| OpenRouter | Yes (OPENROUTER_API_KEY) |
Meta-gateway to GPT, Claude, Llama, Grok, Mistral, … through one key | Paid |
| OpenCode Zen | Yes (OPENCODE_ZEN_API_KEY) |
OpenCode's curated gateway to Claude, GPT-5, Gemini 3, GLM, Kimi, Qwen behind one key | Paid (free tier) |
| OpenAI-compatible | Yes (OPENAI_COMPATIBLE_API_KEY) |
Any other OpenAI-compatible endpoint (Together, Groq, vLLM, …) | Depends |
Set up environment variables
Each cloud provider needs its API key in an environment variable. The
sections below show the one-line export for each provider; pick one
and you are set.
For a key to survive new shells, add the export to your shell profile
(~/.bashrc, ~/.zshrc, ~/.config/fish/config.fish):
$ echo 'export GEMINI_API_KEY="your-key-here"' >> ~/.bashrc
$ source ~/.bashrc
A one-shot export in the current shell is enough for testing — it
just disappears when the terminal closes.
This page is the setup walk-through for provider keys and the embed / rerank role keys. Operational tunables — memory directory overrides, MCP token storage, snapshot toggles, the self-update opt-out, and the rest — live in the CLI reference. Reach for that page when you want a single table of every variable Cantrip reads.
Use local inference snaps
Ubuntu inference snaps are the Canonical-native path: a production-grade OpenAI-compatible server, installed from the Snap Store, running models locally on your GPU with no API key or internet connection required. Install the snap first:
$ sudo snap install gemma3
$ cantrip --provider inference-snap --snap gemma3
Other supported snaps include gemma4 (Gemma 3n E4B, multimodal),
nemotron-3-nano for lighter workloads, qwen3-coder for
code-focused work with native tool calling, and qwen-vl for
vision tasks. The quality of output depends on your GPU and the
model size.
Local models produce lower-quality output than cloud APIs, particularly for complex charm paths. Consider using a cloud provider for the primary model and a local model for light tasks.
Cantrip clamps the conversation temperature to 0.2 for the inference-snap provider — the local quantised models intermittently break out of the OpenAI tool-call envelope at the frontier-default 0.7 and emit raw chat-template scaffolding inside the assistant content, which the conversation loop then mistakes for a final reply. The clamp is per-provider; cloud APIs still run at 0.7.
The provider also disables chain-of-thought on the snap. Thinking
models served through llama.cpp's --jinja (the Qwen3
family especially) route their reasoning into
reasoning_content, and on a tight per-slot context the
prompt alone can leave too little room for both a
<think> block and an answer — so the
turn comes back empty (no content, no tool calls) and
the agent loop stalls. Cantrip sends
chat_template_kwargs: {enable_thinking: false} on every
inference-snap request; chat templates that don't recognise the kwarg
(gemma3, deepseek-r1, …) ignore it, so it's harmless. If you want a
snap to think, that's not currently configurable on this provider —
use a cloud provider with thinking_budget support
instead.
The provider also auto-detects the runtime per-slot context size
from llama.cpp's /slots and /props
endpoints. The trained context (often 128 K or 256 K) is usually
bigger than the per-slot budget the snap actually serves
(typically 8 K – 32 K depending on --ctx-size and
--parallel); without the runtime probe Cantrip would
treat the model as having far more headroom than it does and skip
compaction entirely. Run <snap-name> status if
you suspect the wrong context size — Cantrip logs the
runtime/trained mismatch at INFO when it downgrades.
Tune the snap HTTP read timeout
Slow local snaps (qwen3-coder on a partial-offload setup, large quantised models on smaller GPUs) can take 8–15 minutes to finish a single big-file rewrite once the conversation is several KB long. Cantrip ships a 1200 s (20 min) read timeout by default — long enough for any plausible single-turn generation on the slowest local snap, short enough that a genuinely stuck server doesn't hang the conversation forever.
Override the timeout on faster GPUs to fail-fast instead:
$ cantrip --provider inference-snap --snap qwen3-coder --snap-read-timeout 300
$ CANTRIP_SNAP_READ_TIMEOUT=600 cantrip --provider inference-snap --snap gemma4
The CLI flag wins over the environment variable, which wins over the 1200 s default. A non-numeric or non-positive value logs a warning and falls back to the default rather than crashing.
When the snap drops mid-stream (the qwen3-coder snap occasionally hangs
up at long generations), Cantrip surfaces the recovery as a
[provider reconnect] system message in the chat and
retries with a short exponential backoff (~2/4/8 s) before giving up.
The conversation loop stays alive across the retry — no need to
re-launch.
Short-session mode for tight-context snaps
Some snaps run with a small per-slot context window — gemma4 (Gemma 3n
E4B) gives roughly 10 K tokens per slot, and the system prompt plus
tool schemas already fill a third of that before a conversation starts.
Cantrip detects this at startup: when the usable window is below
~16 K it switches into short-session mode, which compacts at
50 % of the window instead of 80 %, replaces the prose-summary
compaction with a one-line-per-tool-call history ledger (dropping the
raw older messages rather than keeping them around), trims the toolset to
just what the current phase needs, and treats each turn as a near-fresh
conversation. The status bar shows a [short-session] chip
while it is active, and /cost reports the compaction
strategy. The trade-off is real — the agent loses some cross-edit memory,
so a debugging loop that spans several files will be weaker than it would
be on a roomier model — but it lets a 10 K model actually finish a
multi-edit charm without erroring on exceed_context_size.
Force the mode with --short-session=on|off (or CANTRIP_SHORT_SESSION)
to opt a borderline ~16–32 K provider in or out:
$ cantrip --provider inference-snap --snap qwen3-coder --short-session on
Use Gemini (default)
Gemini is the default cloud provider. Get an API key from Google AI Studio, then:
$ export GEMINI_API_KEY="your-key-here"
$ cantrip
To use a specific Gemini model:
$ cantrip --model gemini-2.5-pro
Use Claude
Claude often produces higher-quality charm code, especially for complex infrastructure charms (Path C). Get a key from the Anthropic console:
$ export ANTHROPIC_API_KEY="your-key-here"
$ cantrip --provider claude
To specify a model:
$ cantrip --provider claude --model claude-sonnet-4-6
Use Fireworks.ai
Fireworks hosts strong open-weights models — Kimi K2 (agentic/coding), GLM, MiniMax, DeepSeek — behind an OpenAI-compatible API. Get a key from your Fireworks account, then:
$ export FIREWORKS_API_KEY="your-key-here"
$ cantrip --provider fireworks
The default model is
accounts/fireworks/models/kimi-k2p6, a 256k-context
agentic model with native tool-use. Override with --model:
$ cantrip --provider fireworks \
--model accounts/fireworks/models/glm-5p1
Cantrip auto-detects the selected model's context window and
capability flags (tool use, vision) from the Fireworks
/models endpoint at startup.
Kimi K2 (and the DeepSeek-R1 and GLM reasoning variants)
emits reasoning_content alongside the final
reply, and the reasoning tokens count against
max_tokens. A low cap will be consumed
entirely by reasoning and leave nothing for the answer — a
prompt with max_tokens=30 can come back with 30
completion tokens and an empty response string.
Cantrip surfaces the reasoning through the same
_thinking_content metadata channel Claude uses
for extended thinking, and honours
thinking_budget on Fireworks by raising
max_tokens to at least
thinking_budget + 4096. For manual
testing, set max_tokens to at least 4 096
for simple prompts and 16 000+ for tool-using turns.
Use OpenRouter
OpenRouter is a meta-gateway to hundreds of models — OpenAI GPT, Anthropic Claude, Meta Llama, Mistral, Grok, DeepSeek — behind a single OpenAI-compatible API and a single key. Useful when you want a model Cantrip doesn't ship a dedicated provider for, or when you want to A/B the same prompt across vendors.
Get a key from your OpenRouter keys page, then:
$ export OPENROUTER_API_KEY="your-key-here"
$ cantrip --provider openrouter
The default model is openai/gpt-4o —
a long-lived choice that sits outside the coverage of Cantrip's
other providers. Override with any OpenRouter slug:
$ cantrip --provider openrouter \
--model meta-llama/llama-3.3-70b-instruct
$ cantrip --provider openrouter \
--model x-ai/grok-4-fast
Cantrip auto-detects the selected model's context window, tool
support, and vision support from the OpenRouter
/models catalogue at startup, and sends
HTTP-Referer and X-Title
headers so Cantrip usage shows up on OpenRouter's public
ranking dashboards.
Prefer a dedicated provider when one exists — OpenRouter
adds a routing hop (and a small markup) on top of the
upstream vendor. Use claude for Anthropic,
gemini for Google, and fireworks
for open-weights models that Fireworks hosts directly.
OpenRouter is the right call when the model you want is not
on any of those.
Use OpenCode Zen
OpenCode Zen is a curated model gateway run by the OpenCode project. It exposes Anthropic Claude (Opus, Sonnet, Haiku), OpenAI GPT-5 family, Gemini 3 family, and a handful of strong open-weights models (GLM, Kimi, Qwen, MiniMax) behind a single OpenAI-compatible API and a single key, with a free tier for the lighter models.
Get a key from the OpenCode Zen page, then:
$ export OPENCODE_ZEN_API_KEY="your-key-here"
$ cantrip --provider opencode-zen
The default model is claude-haiku-4-5 — fast, cheap, with native
tool use. Override with any Zen slug (no vendor prefix):
$ cantrip --provider opencode-zen --model gpt-5.5
$ cantrip --provider opencode-zen --model gemini-3.1-pro
$ cantrip --provider opencode-zen --model kimi-k2.6
The legacy ZEN_API_KEY environment variable is also accepted as a
fallback when OPENCODE_ZEN_API_KEY is unset.
Like OpenRouter, Zen adds a routing hop on top of the upstream
vendor. Prefer a dedicated provider when one exists —
claude for Anthropic, gemini for
Google, fireworks for the open-weights models
Fireworks hosts directly. Reach for opencode-zen
when you want OpenCode's curation, its free tier, or a model
only available there.
Use any OpenAI-compatible endpoint
For any backend that speaks the OpenAI chat-completions wire format —
Together, Groq, DeepInfra, LiteLLM proxies, self-hosted vLLM — use
the openai-compatible provider. You must supply
the base URL and model ID explicitly:
$ export OPENAI_COMPATIBLE_API_KEY="your-key-here"
$ cantrip --provider openai-compatible \
--base-url https://api.together.xyz/v1 \
--model meta-llama/Llama-3.3-70B-Instruct-Turbo
For endpoints that don't require authentication (e.g. a local
vLLM instance on your network), set
OPENAI_COMPATIBLE_API_KEY to any non-empty
string.
Prefer a dedicated provider when one exists:
inference-snap for Canonical's local servers,
fireworks for Fireworks.ai. Dedicated providers
carry sensible defaults and model-catalogue probing; the
generic provider requires you to supply everything by hand.
Configure embed and rerank
Retrieval features — the planned @docs index and memory recall
— need an embedding provider, and rerank quality benefits
from a dedicated rerank provider. Cantrip routes these via a
separate RoleRouter so you pick them independently of the chat
provider.
The Anthropic-ecosystem recommendation is Voyage:
$ export VOYAGE_API_KEY=...
$ cantrip --provider claude \
--embed-provider voyage --embed-model voyage-3 \
--rerank-provider voyage --rerank-model rerank-2
OpenAI users can pair their embed endpoint with Voyage rerank (OpenAI does not ship a rerank API):
$ export OPENAI_API_KEY=... VOYAGE_API_KEY=...
$ cantrip --provider claude \
--embed-provider openai --embed-model text-embedding-3-large \
--rerank-provider voyage
Equivalent environment variables for stable shells:
CANTRIP_EMBED_PROVIDER, CANTRIP_EMBED_MODEL,
CANTRIP_RERANK_PROVIDER, CANTRIP_RERANK_MODEL. The CLI flag
wins when both are present.
Local embed servers (Ollama, vLLM, llama.cpp)
Anything that exposes the OpenAI /v1/embeddings wire format can
serve as the embed provider. Set OPENAI_EMBED_BASE_URL to the
endpoint — the API-key requirement is automatically relaxed
when this override is present, since most local servers do not
authenticate.
$ ollama pull nomic-embed-text
$ export OPENAI_EMBED_BASE_URL="http://localhost:11434/v1"
$ cantrip --provider claude \
--embed-provider openai --embed-model nomic-embed-text \
--rerank-provider voyage
Tested shapes:
- Ollama —
http://localhost:11434/v1, model name matches the pulled tag (e.g.nomic-embed-text,mxbai-embed-large,bge-m3). - vLLM —
http://localhost:8000/v1when launched withvllm serve <embed-model> --task embed. - llama.cpp
llama-server—http://localhost:8080/v1when launched with--embedding --pooling mean. - Canonical inference snaps — the chat snaps (gemma3,
gemma4, deepseek-r1, etc.) do not expose
/v1/embeddings; an embed-only inference snap is in development.
If the local server does require authentication, set
OPENAI_API_KEY alongside OPENAI_EMBED_BASE_URL and the bearer
token will be forwarded.
Costs surface in /cost under a separate By role section once
an embed or rerank call has fired, so retrieval spend is visible
without merging into chat. Pricing entries cover voyage-3,
voyage-3-lite, voyage-3-large, voyage-code-3, rerank-2,
rerank-2-lite, text-embedding-3-small, and text-embedding-3-large;
unknown models render as free.
An offline sentence-transformers fallback is on the roadmap but not
yet shipped — sessions without a configured embed provider
raise RoleNotConfigured from the first retrieval call rather than
silently degrading.
Hybrid setups
You can combine providers — use a powerful cloud model for code generation and a local model for internal tasks like research summarisation and log queries:
$ cantrip --provider claude \
--light-provider inference-snap --light-snap nemotron-3-nano
--light-provider accepts gemini,
claude, inference-snap, fireworks,
openrouter, or opencode-zen.
See Configure light models for full
details on cost routing.