Choose an LLM provider

Cantrip supports seven LLM providers. This guide helps you pick the right one for your situation.

Provider comparison

Provider API key needed Best for Cost
Inference snap No Air-gapped environments, privacy, Canonical-native stack Free (local GPU)
Gemini (default) Yes (GEMINI_API_KEY) General use, generous free tier Free tier available
Claude Yes (ANTHROPIC_API_KEY) Best output quality, complex charms Paid
Fireworks.ai Yes (FIREWORKS_API_KEY) Open-weights models (Kimi, GLM, DeepSeek) with tool use Paid
OpenRouter Yes (OPENROUTER_API_KEY) Meta-gateway to GPT, Claude, Llama, Grok, Mistral, … through one key Paid
OpenCode Zen Yes (OPENCODE_ZEN_API_KEY) OpenCode's curated gateway to Claude, GPT-5, Gemini 3, GLM, Kimi, Qwen behind one key Paid (free tier)
OpenAI-compatible Yes (OPENAI_COMPATIBLE_API_KEY) Any other OpenAI-compatible endpoint (Together, Groq, vLLM, …) Depends

Set up environment variables

Each cloud provider needs its API key in an environment variable. The sections below show the one-line export for each provider; pick one and you are set.

For a key to survive new shells, add the export to your shell profile (~/.bashrc, ~/.zshrc, ~/.config/fish/config.fish):

$ echo 'export GEMINI_API_KEY="your-key-here"' >> ~/.bashrc
$ source ~/.bashrc

A one-shot export in the current shell is enough for testing — it just disappears when the terminal closes.

This page is the setup walk-through for provider keys and the embed / rerank role keys. Operational tunables — memory directory overrides, MCP token storage, snapshot toggles, the self-update opt-out, and the rest — live in the CLI reference. Reach for that page when you want a single table of every variable Cantrip reads.

Use local inference snaps

Ubuntu inference snaps are the Canonical-native path: a production-grade OpenAI-compatible server, installed from the Snap Store, running models locally on your GPU with no API key or internet connection required. Install the snap first:

$ sudo snap install gemma3
$ cantrip --provider inference-snap --snap gemma3

Other supported snaps include gemma4 (Gemma 3n E4B, multimodal), nemotron-3-nano for lighter workloads, qwen3-coder for code-focused work with native tool calling, and qwen-vl for vision tasks. The quality of output depends on your GPU and the model size.

Local models produce lower-quality output than cloud APIs, particularly for complex charm paths. Consider using a cloud provider for the primary model and a local model for light tasks.

Cantrip clamps the conversation temperature to 0.2 for the inference-snap provider — the local quantised models intermittently break out of the OpenAI tool-call envelope at the frontier-default 0.7 and emit raw chat-template scaffolding inside the assistant content, which the conversation loop then mistakes for a final reply. The clamp is per-provider; cloud APIs still run at 0.7.

The provider also disables chain-of-thought on the snap. Thinking models served through llama.cpp's --jinja (the Qwen3 family especially) route their reasoning into reasoning_content, and on a tight per-slot context the prompt alone can leave too little room for both a <think> block and an answer — so the turn comes back empty (no content, no tool calls) and the agent loop stalls. Cantrip sends chat_template_kwargs: {enable_thinking: false} on every inference-snap request; chat templates that don't recognise the kwarg (gemma3, deepseek-r1, …) ignore it, so it's harmless. If you want a snap to think, that's not currently configurable on this provider — use a cloud provider with thinking_budget support instead.

The provider also auto-detects the runtime per-slot context size from llama.cpp's /slots and /props endpoints. The trained context (often 128 K or 256 K) is usually bigger than the per-slot budget the snap actually serves (typically 8 K – 32 K depending on --ctx-size and --parallel); without the runtime probe Cantrip would treat the model as having far more headroom than it does and skip compaction entirely. Run <snap-name> status if you suspect the wrong context size — Cantrip logs the runtime/trained mismatch at INFO when it downgrades.

Tune the snap HTTP read timeout

Slow local snaps (qwen3-coder on a partial-offload setup, large quantised models on smaller GPUs) can take 8–15 minutes to finish a single big-file rewrite once the conversation is several KB long. Cantrip ships a 1200 s (20 min) read timeout by default — long enough for any plausible single-turn generation on the slowest local snap, short enough that a genuinely stuck server doesn't hang the conversation forever.

Override the timeout on faster GPUs to fail-fast instead:

$ cantrip --provider inference-snap --snap qwen3-coder --snap-read-timeout 300
$ CANTRIP_SNAP_READ_TIMEOUT=600 cantrip --provider inference-snap --snap gemma4

The CLI flag wins over the environment variable, which wins over the 1200 s default. A non-numeric or non-positive value logs a warning and falls back to the default rather than crashing.

When the snap drops mid-stream (the qwen3-coder snap occasionally hangs up at long generations), Cantrip surfaces the recovery as a [provider reconnect] system message in the chat and retries with a short exponential backoff (~2/4/8 s) before giving up. The conversation loop stays alive across the retry — no need to re-launch.

Short-session mode for tight-context snaps

Some snaps run with a small per-slot context window — gemma4 (Gemma 3n E4B) gives roughly 10 K tokens per slot, and the system prompt plus tool schemas already fill a third of that before a conversation starts. Cantrip detects this at startup: when the usable window is below ~16 K it switches into short-session mode, which compacts at 50 % of the window instead of 80 %, replaces the prose-summary compaction with a one-line-per-tool-call history ledger (dropping the raw older messages rather than keeping them around), trims the toolset to just what the current phase needs, and treats each turn as a near-fresh conversation. The status bar shows a [short-session] chip while it is active, and /cost reports the compaction strategy. The trade-off is real — the agent loses some cross-edit memory, so a debugging loop that spans several files will be weaker than it would be on a roomier model — but it lets a 10 K model actually finish a multi-edit charm without erroring on exceed_context_size.

Force the mode with --short-session=on|off (or CANTRIP_SHORT_SESSION) to opt a borderline ~16–32 K provider in or out:

$ cantrip --provider inference-snap --snap qwen3-coder --short-session on

Use Gemini (default)

Gemini is the default cloud provider. Get an API key from Google AI Studio, then:

$ export GEMINI_API_KEY="your-key-here"
$ cantrip

To use a specific Gemini model:

$ cantrip --model gemini-2.5-pro

Use Claude

Claude often produces higher-quality charm code, especially for complex infrastructure charms (Path C). Get a key from the Anthropic console:

$ export ANTHROPIC_API_KEY="your-key-here"
$ cantrip --provider claude

To specify a model:

$ cantrip --provider claude --model claude-sonnet-4-6

Use Fireworks.ai

Fireworks hosts strong open-weights models — Kimi K2 (agentic/coding), GLM, MiniMax, DeepSeek — behind an OpenAI-compatible API. Get a key from your Fireworks account, then:

$ export FIREWORKS_API_KEY="your-key-here"
$ cantrip --provider fireworks

The default model is accounts/fireworks/models/kimi-k2p6, a 256k-context agentic model with native tool-use. Override with --model:

$ cantrip --provider fireworks \
    --model accounts/fireworks/models/glm-5p1

Cantrip auto-detects the selected model's context window and capability flags (tool use, vision) from the Fireworks /models endpoint at startup.

Kimi K2 (and the DeepSeek-R1 and GLM reasoning variants) emits reasoning_content alongside the final reply, and the reasoning tokens count against max_tokens. A low cap will be consumed entirely by reasoning and leave nothing for the answer — a prompt with max_tokens=30 can come back with 30 completion tokens and an empty response string.

Cantrip surfaces the reasoning through the same _thinking_content metadata channel Claude uses for extended thinking, and honours thinking_budget on Fireworks by raising max_tokens to at least thinking_budget + 4096. For manual testing, set max_tokens to at least 4 096 for simple prompts and 16 000+ for tool-using turns.

Use OpenRouter

OpenRouter is a meta-gateway to hundreds of models — OpenAI GPT, Anthropic Claude, Meta Llama, Mistral, Grok, DeepSeek — behind a single OpenAI-compatible API and a single key. Useful when you want a model Cantrip doesn't ship a dedicated provider for, or when you want to A/B the same prompt across vendors.

Get a key from your OpenRouter keys page, then:

$ export OPENROUTER_API_KEY="your-key-here"
$ cantrip --provider openrouter

The default model is openai/gpt-4o — a long-lived choice that sits outside the coverage of Cantrip's other providers. Override with any OpenRouter slug:

$ cantrip --provider openrouter \
    --model meta-llama/llama-3.3-70b-instruct

$ cantrip --provider openrouter \
    --model x-ai/grok-4-fast

Cantrip auto-detects the selected model's context window, tool support, and vision support from the OpenRouter /models catalogue at startup, and sends HTTP-Referer and X-Title headers so Cantrip usage shows up on OpenRouter's public ranking dashboards.

Prefer a dedicated provider when one exists — OpenRouter adds a routing hop (and a small markup) on top of the upstream vendor. Use claude for Anthropic, gemini for Google, and fireworks for open-weights models that Fireworks hosts directly. OpenRouter is the right call when the model you want is not on any of those.

Use OpenCode Zen

OpenCode Zen is a curated model gateway run by the OpenCode project. It exposes Anthropic Claude (Opus, Sonnet, Haiku), OpenAI GPT-5 family, Gemini 3 family, and a handful of strong open-weights models (GLM, Kimi, Qwen, MiniMax) behind a single OpenAI-compatible API and a single key, with a free tier for the lighter models.

Get a key from the OpenCode Zen page, then:

$ export OPENCODE_ZEN_API_KEY="your-key-here"
$ cantrip --provider opencode-zen

The default model is claude-haiku-4-5 — fast, cheap, with native tool use. Override with any Zen slug (no vendor prefix):

$ cantrip --provider opencode-zen --model gpt-5.5

$ cantrip --provider opencode-zen --model gemini-3.1-pro

$ cantrip --provider opencode-zen --model kimi-k2.6

The legacy ZEN_API_KEY environment variable is also accepted as a fallback when OPENCODE_ZEN_API_KEY is unset.

Like OpenRouter, Zen adds a routing hop on top of the upstream vendor. Prefer a dedicated provider when one exists — claude for Anthropic, gemini for Google, fireworks for the open-weights models Fireworks hosts directly. Reach for opencode-zen when you want OpenCode's curation, its free tier, or a model only available there.

Use any OpenAI-compatible endpoint

For any backend that speaks the OpenAI chat-completions wire format — Together, Groq, DeepInfra, LiteLLM proxies, self-hosted vLLM — use the openai-compatible provider. You must supply the base URL and model ID explicitly:

$ export OPENAI_COMPATIBLE_API_KEY="your-key-here"
$ cantrip --provider openai-compatible \
    --base-url https://api.together.xyz/v1 \
    --model meta-llama/Llama-3.3-70B-Instruct-Turbo

For endpoints that don't require authentication (e.g. a local vLLM instance on your network), set OPENAI_COMPATIBLE_API_KEY to any non-empty string.

Prefer a dedicated provider when one exists: inference-snap for Canonical's local servers, fireworks for Fireworks.ai. Dedicated providers carry sensible defaults and model-catalogue probing; the generic provider requires you to supply everything by hand.

Configure embed and rerank

Retrieval features — the planned @docs index and memory recall — need an embedding provider, and rerank quality benefits from a dedicated rerank provider. Cantrip routes these via a separate RoleRouter so you pick them independently of the chat provider.

The Anthropic-ecosystem recommendation is Voyage:

$ export VOYAGE_API_KEY=...
$ cantrip --provider claude \
    --embed-provider voyage --embed-model voyage-3 \
    --rerank-provider voyage --rerank-model rerank-2

OpenAI users can pair their embed endpoint with Voyage rerank (OpenAI does not ship a rerank API):

$ export OPENAI_API_KEY=... VOYAGE_API_KEY=...
$ cantrip --provider claude \
    --embed-provider openai --embed-model text-embedding-3-large \
    --rerank-provider voyage

Equivalent environment variables for stable shells: CANTRIP_EMBED_PROVIDER, CANTRIP_EMBED_MODEL, CANTRIP_RERANK_PROVIDER, CANTRIP_RERANK_MODEL. The CLI flag wins when both are present.

Local embed servers (Ollama, vLLM, llama.cpp)

Anything that exposes the OpenAI /v1/embeddings wire format can serve as the embed provider. Set OPENAI_EMBED_BASE_URL to the endpoint — the API-key requirement is automatically relaxed when this override is present, since most local servers do not authenticate.

$ ollama pull nomic-embed-text
$ export OPENAI_EMBED_BASE_URL="http://localhost:11434/v1"
$ cantrip --provider claude \
    --embed-provider openai --embed-model nomic-embed-text \
    --rerank-provider voyage

Tested shapes:

If the local server does require authentication, set OPENAI_API_KEY alongside OPENAI_EMBED_BASE_URL and the bearer token will be forwarded.

Costs surface in /cost under a separate By role section once an embed or rerank call has fired, so retrieval spend is visible without merging into chat. Pricing entries cover voyage-3, voyage-3-lite, voyage-3-large, voyage-code-3, rerank-2, rerank-2-lite, text-embedding-3-small, and text-embedding-3-large; unknown models render as free.

An offline sentence-transformers fallback is on the roadmap but not yet shipped — sessions without a configured embed provider raise RoleNotConfigured from the first retrieval call rather than silently degrading.

Hybrid setups

You can combine providers — use a powerful cloud model for code generation and a local model for internal tasks like research summarisation and log queries:

$ cantrip --provider claude \
    --light-provider inference-snap --light-snap nemotron-3-nano

--light-provider accepts gemini, claude, inference-snap, fireworks, openrouter, or opencode-zen. See Configure light models for full details on cost routing.