Multi-model patterns: racing, arena, oracle
Three shapes for involving more than one model: an objective Best-of-N race scored by measurable charm quality, a blind A/B /arena that captures human preference, and an on-demand oracle_consult for one-shot second opinions.
Library-only surface for objective racing.
/arena is fully user-facing, but the Best-of-N
race coordinator is wired into the executor behind a
RaceConfig that has no CLI flag or environment
variable yet. Racing runs only when a caller constructs an
executor with a non-default config — see
Current limits. This page documents the
design so you can read race events in transcripts and anticipate
the surface when it lands.
Why race at all
Charm building has an unusually clean success signal. A generated charm can be packed, linted with charmlint, unit-tested, and scored against an operational-readiness checklist. Unlike open-ended writing tasks, the output is measurable: a charm with zero charmlint errors and 95 % readiness is objectively better than one with six errors and 60 % readiness, regardless of how either got there.
Given a measurable output and an embarrassingly parallel work pattern (each candidate runs in its own git worktree — see How Cantrip works), running several models on the same task and picking the winner is a natural fit. It also sidesteps the “which model is best today” argument: you don’t have to choose in advance if you race them and let the rubric decide.
Three mechanisms, same goal
Cantrip exposes three multi-model shapes. They share the goal of "don’t bet the session on a single model" but the implementations and intended uses are distinct.
| Best-of-N race | Blind A/B Arena | Oracle consult | |
|---|---|---|---|
| Module | cantrip.agent.race |
cantrip.agent.arena |
cantrip.agent.tools.oracle |
| Per call | Full subagent loop with tools | Single provider completion, no tools | Single provider completion, no tools |
| Scope | Per TaskCategory (BUILD, DESIGN, …) |
One-off, user-triggered | One-off, agent-triggered |
| Picks the answer | Rubric score against charm outputs | You pick, blind to model names | The oracle’s answer is the answer |
| Outcome | Winner’s worktree merged back | Preference written to global memory | Tool result the agent cites |
| Trigger | Automatic, per RaceConfig |
/arena <prompt> |
Agent calls oracle_consult |
| Effect on chat history | Winner’s turns enter the transcript | Neither side enters state.messages |
Does not enter state.messages |
The scoring rubric
Four signals combine into a single total in [0.0, 1.0].
Weights sum to one so scores are directly comparable across a
pool, even when candidates happen to run different test counts
or produce different diff sizes.
| Signal | Weight | Why this weight |
|---|---|---|
| Charmlint violations (weighted by severity) | 30 % | Errors are usually spec violations that break the charm |
| Operational-readiness percentage | 30 % | Captures whether the charm has the moving parts a real operator needs |
| Unit-test pass ratio | 25 % | High-signal but the other two lead for “shippable” |
| Diff size (smaller is better) | 15 % | Tie-breaker that nudges toward focused changes |
Each signal is scored into [0, 1] independently, then
the total is a weighted sum. The constants all live in
src/cantrip/agent/race.py; tune them in one place
rather than scattering magic numbers.
What each signal measures
Charmlint — exponential decay on weighted violations
Each violation is weighted by severity: error × 3,
warning × 1, info × 0.1.
The weighted total feeds an exponential decay with a constant of
10, so a clean charm scores 1.0, one error drops to ~0.74, and
a charm with three errors and several warnings falls below 0.25.
Errors dominate because they block shipping; warnings are a
speed bump; infos are advisory.
The scorer calls the same charmlint tool the agent uses elsewhere, so the Rust-vs-Python backend selection stays in one place. A tool failure degrades to zeroed counts rather than crashing the race.
Readiness — linear on the overall score
The operational-readiness tool produces an overall percentage
between 0 and 100. The scorer normalises it to [0, 1].
When the tool can’t evaluate the directory (for example, no
charmcraft.yaml), the signal returns 0 rather than
0.5: a candidate that isn’t actually a charm should lose to
one that is. The readiness tool writes an
OPERATIONAL_READINESS.md report into the worktree as
a side effect — the scorer measures diff before
running readiness so the uncommitted report doesn’t inflate
diff-size counts.
Tests — normalised pass ratio
The test signal is passed / total when any tests exist,
and 1.0 when none do — a candidate shouldn’t be
penalised for working in a test-free area. Integration test
counts are a follow-up; the current rubric scores unit tests
only. Baseline-aware scoring (“this candidate ran fewer
tests than the others; penalise it proportionally”) is not
yet implemented.
Diff size — linear penalty, capped at 2000 lines
Smaller diffs score higher. The decay is linear up to a cap of 2000 lines; anything above the cap scores 0 for this signal. A zero-line diff is suspicious (the candidate may have committed nothing) and gets a middling 0.5 so charmlint and readiness decide the winner rather than rewarding inaction.
The diff is taken against the worktree’s base_sha
via git diff --numstat base_sha..HEAD, so only
committed changes count. Binary files are skipped.
Git errors fall through to (0, 0) rather than
crashing — a broken measurement shouldn’t sink a race.
Viability and tie-breaking
A candidate’s ExitState short-circuits the
rubric before subscores are combined:
COMPLETEDandBLOCKEDare viable — blocked runs can still be worth merging if they produced partial progress while the user resolves the block.FAILEDandNOOPforce a total of0.0regardless of the other signals. A failed candidate with clean charmlint (because it never changed anything) is not a win.
Ties break on lower diff_lines (smaller change wins)
and then on lexicographic candidate_id, so repeated
races with the same pool produce the same winner when the
underlying measurements agree. A is_perfect threshold
of 0.999 exists as a hook for early cancellation
(RaceConfig.cancel_on_perfect), but early cancel
isn’t implemented yet — the coordinator waits for every
candidate.
RaceConfig and cost gates
RaceConfig is the opt-in surface. The default
disables racing entirely — enabled_categories
is an empty frozenset, so should_race always returns
False and the executor falls through to a single-subagent run.
enabled_categories(default: empty)-
The
TaskCategoryvalues that are allowed to race. Typical values are{BUILD, DESIGN}: objectively measurable work where Best-of-N pays off. max_candidates(default: 3)-
Upper bound on race width.
clamp_candidatestrims any pool larger than this. A setting of 0 or less disables racing even for enabled categories. budget_tokens(default: 500 000)- Hard cap on estimated total tokens. Races whose pre-run estimate exceeds this budget downgrade silently to a single-subagent run. Set to 0 or a negative value to disable the cap.
confirm_threshold_tokens(default: 200 000)-
Soft gate. Estimates above this threshold but below the
hard budget surface a
CONFIRMtask so you can approve or decline the spend. Tuned so a two-way race on a typical BUILD task fires the gate but a cheap DESIGN race doesn’t. baseline_tokens_per_run(default: 75 000)- Per-candidate token estimate used to multiply out the pre-race cost. Deliberately low so the CONFIRM gate fires early for racy tasks. Once streaming-usage aggregation lands, mid-flight accounting will replace this static estimate.
cancel_on_perfect(default: True)- Reserved for early cancellation when a candidate hits the perfect-score threshold. Not yet implemented; the coordinator waits for every candidate today.
The three-way gate
At dispatch time the executor classifies every would-be race into one of three outcomes:
| Outcome | Condition | What happens |
|---|---|---|
RACE |
Estimate ≤ confirm_threshold_tokens |
Race runs silently |
CONFIRM |
Threshold < estimate ≤ budget_tokens |
A CONFIRM task gates the parent; reply yes or no |
DOWNGRADE |
Estimate > budget_tokens or user declined |
Falls through to a single-subagent run |
User decisions persist on the task (task.race_decision)
so a task that re-enters the executor for any reason is not
re-prompted. The CONFIRM task id is
race-confirm-<parent-task-id>; the executor
reuses an existing CONFIRM rather than creating duplicates.
Blind A/B Arena
/arena <prompt> sends the same prompt to both
the primary and light providers concurrently, shuffles the two
replies into labels A and B (hiding
model names), and asks you to pick. Responses are capped at
2 000 tokens so the A/B block stays readable side-by-side.
Recognised replies are forgiving and case-insensitive:
A,pick A,leftB,pick B,righttie,equal,both,neither,tskip,cancel,abort,never mind
Unrecognised replies fall through to normal chat — you aren’t locked out of talking to the agent while an arena is pending. The TUI, CLI, and Web frontends all intercept pending picks before routing the reply to the LLM.
Picks and ties write a fact memory at
global scope (so the preference carries across
charms), tagged arena and model-preference,
with source="arena" and a
arena-preference-<8-hex> title. The body
names both models and includes a 200-character excerpt of the
prompt so the preference is attributable to a specific ask.
skip clears the session without writing. See
the memory how-to for the
full memory model and
the CLI reference
for the exact command syntax.
Arena refuses to start when both sides would resolve to the
same (provider, model) pair — a blind A/B
against identical configurations produces no signal and wastes
tokens. It also requires a configured light provider
(--light-provider or
CANTRIP_LIGHT_PROVIDER).
Oracle consults
oracle_consult is a tool the agent calls during a session
when it hits a hard, judgement-shaped question that the docs
cannot settle on their own. The tool sends a single focused
question — plus a compact context bundle (active charm,
caller-supplied hint, last few messages) — to a stronger
reasoning model and returns the answer. The main session
keeps running on its current model.
The intended uses are deliberately narrow:
- Charm-architecture choices (Path A / B / C is unclear, peer-relation topology, leader-election placement, sidecar-vs-separate-charm).
- Security-relevant design (secret rotation, TLS termination, RBAC boundary).
- Library-vs-custom-code trade-offs (use a charmlib, vendor a slice, write it yourself).
- Reactive-to-ops migration heuristics.
Not for syntax lookups or routine implementation steps — the docs and the active skill cover those without paying the oracle tax. The system prompt names both lists so the agent reaches for the oracle only when the call is justified.
Defaults
| Knob | Default |
|---|---|
| Provider | claude |
| Model | claude-opus-4-7 |
| Reasoning budget | 8000 tokens |
| Output cap | 4096 tokens |
| Temperature | 0.2 |
| Per-turn call cap | 1 |
| Per-session cost cap | $2 |
The provider and model can be overridden per-session via
state.oracle_provider_name and state.oracle_model. The
caps live on AgentState too — raise them when a session
genuinely benefits from more consults, not as a blanket policy.
Budget model
Two caps protect the session:
- Per-turn cap.
state.oracle_max_calls_per_turn(default1) limits invocations between user messages. The counter resets at the top of every conversation turn so the agent gets a fresh allowance with each user steering message. - Per-session cap.
state.oracle_max_session_cost_usd(default$2) is a cumulative USD ceiling. Cost is computed bycantrip.llm.pricing.estimate_costfrom the response’s usage payload, so the meter is grounded in real billing rather than a fixed per-call charge.
Either cap returns a structured tool error the agent sees and explains in its summary. No half-state: a refused call leaves the counters untouched.
Why not just compaction-aware in-line context?
Oracle is not a replacement for thoughtful prompting on the primary model. The pattern earns its keep when the agent has already spent context on the problem and a fresh heavyweight reading produces a better answer than yet more turns on the running session. Picking it for routine work would be expensive and pointless; picking it for one well-formed architecture question is the whole point.
The oracle’s answer does not enter state.messages. It comes
back as a tool result, which means: (a) the main context window
stays focused on the work in progress; (b) the agent must
restate the recommendation in its next text reply rather than
silently quoting the oracle. The transcript records the full
exchange (question, context hint, answer, usage, cost) as an
oracle_consult event so audits keep nothing lost.
Transcript events
Races and arenas emit structured events alongside the regular task updates. They land in the session transcript so a reviewer can reconstruct what happened after the fact.
race_confirm_requested-
Emitted when the soft gate fires. Payload carries
task_id,confirm_task_id,estimate_tokens,threshold_tokens, and the candidate id list. race_downgraded-
Emitted when a would-be race runs as a single subagent
instead. The
reasonfield is eitherover_budget(hard cap) oruser_declined(answered no to a CONFIRM). Over-budget downgrades includeestimate_tokensandbudget_tokensso you can see why. race_finished-
One row per race. Carries the winner’s
candidate_idandscore, the candidate list, andelapsed_s. Empty winner fields mean every candidate failed. race_candidate-
One row per candidate, winner or loser. Includes the
candidate’s
exit_state,totalscore, andtranscript_task_id. The transcript task id is<parent_task_id>__<candidate_id>— join againstsubagent_messageson that key to read any loser’s full tool-call trace, not just the winner’s. oracle_consult-
One row per Oracle call. Carries
provider,model, the verbatimquestionandcontext_hint, theanswer, the responseusagedict, andcost_usd.calls_this_turn,calls_total, andsession_cost_usdcapture the budget meters at the moment of the call so an auditor can reconstruct cap-trip events without replaying the full session.
Current limits and planned work
- No user-facing surface for
RaceConfigyet. The executor accepts a programmaticRaceConfigargument, but there are no CLI flags or environment variables to setenabled_categoriesand friends. Racing is reachable today through the Python API only; a proper surface is planned. - No early cancellation.
cancel_on_perfectis a config knob but the coordinator waits for every candidate before scoring. A perfect score doesn’t short-circuit the others yet. - Static cost estimate.
baseline_tokens_per_runis a rough guess, not measured usage. Mid-flight budget accounting (“cancel once we’ve burned through the budget”) is deferred until streaming-usage aggregation lands. - Unit tests only. The test subscore measures unit-test pass/total. Integration test counts are not yet surfaced to the rubric.
See also: