Multi-model patterns: racing, arena, oracle

Three shapes for involving more than one model: an objective Best-of-N race scored by measurable charm quality, a blind A/B /arena that captures human preference, and an on-demand oracle_consult for one-shot second opinions.

Library-only surface for objective racing. /arena is fully user-facing, but the Best-of-N race coordinator is wired into the executor behind a RaceConfig that has no CLI flag or environment variable yet. Racing runs only when a caller constructs an executor with a non-default config — see Current limits. This page documents the design so you can read race events in transcripts and anticipate the surface when it lands.

Why race at all

Charm building has an unusually clean success signal. A generated charm can be packed, linted with charmlint, unit-tested, and scored against an operational-readiness checklist. Unlike open-ended writing tasks, the output is measurable: a charm with zero charmlint errors and 95 % readiness is objectively better than one with six errors and 60 % readiness, regardless of how either got there.

Given a measurable output and an embarrassingly parallel work pattern (each candidate runs in its own git worktree — see How Cantrip works), running several models on the same task and picking the winner is a natural fit. It also sidesteps the “which model is best today” argument: you don’t have to choose in advance if you race them and let the rubric decide.

Three mechanisms, same goal

Cantrip exposes three multi-model shapes. They share the goal of "don’t bet the session on a single model" but the implementations and intended uses are distinct.

Best-of-N race Blind A/B Arena Oracle consult
Module cantrip.agent.race cantrip.agent.arena cantrip.agent.tools.oracle
Per call Full subagent loop with tools Single provider completion, no tools Single provider completion, no tools
Scope Per TaskCategory (BUILD, DESIGN, …) One-off, user-triggered One-off, agent-triggered
Picks the answer Rubric score against charm outputs You pick, blind to model names The oracle’s answer is the answer
Outcome Winner’s worktree merged back Preference written to global memory Tool result the agent cites
Trigger Automatic, per RaceConfig /arena <prompt> Agent calls oracle_consult
Effect on chat history Winner’s turns enter the transcript Neither side enters state.messages Does not enter state.messages

The scoring rubric

Four signals combine into a single total in [0.0, 1.0]. Weights sum to one so scores are directly comparable across a pool, even when candidates happen to run different test counts or produce different diff sizes.

Signal Weight Why this weight
Charmlint violations (weighted by severity) 30 % Errors are usually spec violations that break the charm
Operational-readiness percentage 30 % Captures whether the charm has the moving parts a real operator needs
Unit-test pass ratio 25 % High-signal but the other two lead for “shippable”
Diff size (smaller is better) 15 % Tie-breaker that nudges toward focused changes

Each signal is scored into [0, 1] independently, then the total is a weighted sum. The constants all live in src/cantrip/agent/race.py; tune them in one place rather than scattering magic numbers.

What each signal measures

Charmlint — exponential decay on weighted violations

Each violation is weighted by severity: error × 3, warning × 1, info × 0.1. The weighted total feeds an exponential decay with a constant of 10, so a clean charm scores 1.0, one error drops to ~0.74, and a charm with three errors and several warnings falls below 0.25. Errors dominate because they block shipping; warnings are a speed bump; infos are advisory.

The scorer calls the same charmlint tool the agent uses elsewhere, so the Rust-vs-Python backend selection stays in one place. A tool failure degrades to zeroed counts rather than crashing the race.

Readiness — linear on the overall score

The operational-readiness tool produces an overall percentage between 0 and 100. The scorer normalises it to [0, 1]. When the tool can’t evaluate the directory (for example, no charmcraft.yaml), the signal returns 0 rather than 0.5: a candidate that isn’t actually a charm should lose to one that is. The readiness tool writes an OPERATIONAL_READINESS.md report into the worktree as a side effect — the scorer measures diff before running readiness so the uncommitted report doesn’t inflate diff-size counts.

Tests — normalised pass ratio

The test signal is passed / total when any tests exist, and 1.0 when none do — a candidate shouldn’t be penalised for working in a test-free area. Integration test counts are a follow-up; the current rubric scores unit tests only. Baseline-aware scoring (“this candidate ran fewer tests than the others; penalise it proportionally”) is not yet implemented.

Diff size — linear penalty, capped at 2000 lines

Smaller diffs score higher. The decay is linear up to a cap of 2000 lines; anything above the cap scores 0 for this signal. A zero-line diff is suspicious (the candidate may have committed nothing) and gets a middling 0.5 so charmlint and readiness decide the winner rather than rewarding inaction.

The diff is taken against the worktree’s base_sha via git diff --numstat base_sha..HEAD, so only committed changes count. Binary files are skipped. Git errors fall through to (0, 0) rather than crashing — a broken measurement shouldn’t sink a race.

Viability and tie-breaking

A candidate’s ExitState short-circuits the rubric before subscores are combined:

Ties break on lower diff_lines (smaller change wins) and then on lexicographic candidate_id, so repeated races with the same pool produce the same winner when the underlying measurements agree. A is_perfect threshold of 0.999 exists as a hook for early cancellation (RaceConfig.cancel_on_perfect), but early cancel isn’t implemented yet — the coordinator waits for every candidate.

RaceConfig and cost gates

RaceConfig is the opt-in surface. The default disables racing entirely — enabled_categories is an empty frozenset, so should_race always returns False and the executor falls through to a single-subagent run.

enabled_categories (default: empty)
The TaskCategory values that are allowed to race. Typical values are {BUILD, DESIGN}: objectively measurable work where Best-of-N pays off.
max_candidates (default: 3)
Upper bound on race width. clamp_candidates trims any pool larger than this. A setting of 0 or less disables racing even for enabled categories.
budget_tokens (default: 500 000)
Hard cap on estimated total tokens. Races whose pre-run estimate exceeds this budget downgrade silently to a single-subagent run. Set to 0 or a negative value to disable the cap.
confirm_threshold_tokens (default: 200 000)
Soft gate. Estimates above this threshold but below the hard budget surface a CONFIRM task so you can approve or decline the spend. Tuned so a two-way race on a typical BUILD task fires the gate but a cheap DESIGN race doesn’t.
baseline_tokens_per_run (default: 75 000)
Per-candidate token estimate used to multiply out the pre-race cost. Deliberately low so the CONFIRM gate fires early for racy tasks. Once streaming-usage aggregation lands, mid-flight accounting will replace this static estimate.
cancel_on_perfect (default: True)
Reserved for early cancellation when a candidate hits the perfect-score threshold. Not yet implemented; the coordinator waits for every candidate today.

The three-way gate

At dispatch time the executor classifies every would-be race into one of three outcomes:

Outcome Condition What happens
RACE Estimate ≤ confirm_threshold_tokens Race runs silently
CONFIRM Threshold < estimate ≤ budget_tokens A CONFIRM task gates the parent; reply yes or no
DOWNGRADE Estimate > budget_tokens or user declined Falls through to a single-subagent run

User decisions persist on the task (task.race_decision) so a task that re-enters the executor for any reason is not re-prompted. The CONFIRM task id is race-confirm-<parent-task-id>; the executor reuses an existing CONFIRM rather than creating duplicates.

Blind A/B Arena

/arena <prompt> sends the same prompt to both the primary and light providers concurrently, shuffles the two replies into labels A and B (hiding model names), and asks you to pick. Responses are capped at 2 000 tokens so the A/B block stays readable side-by-side.

Recognised replies are forgiving and case-insensitive:

Unrecognised replies fall through to normal chat — you aren’t locked out of talking to the agent while an arena is pending. The TUI, CLI, and Web frontends all intercept pending picks before routing the reply to the LLM.

Picks and ties write a fact memory at global scope (so the preference carries across charms), tagged arena and model-preference, with source="arena" and a arena-preference-<8-hex> title. The body names both models and includes a 200-character excerpt of the prompt so the preference is attributable to a specific ask. skip clears the session without writing. See the memory how-to for the full memory model and the CLI reference for the exact command syntax.

Arena refuses to start when both sides would resolve to the same (provider, model) pair — a blind A/B against identical configurations produces no signal and wastes tokens. It also requires a configured light provider (--light-provider or CANTRIP_LIGHT_PROVIDER).

Oracle consults

oracle_consult is a tool the agent calls during a session when it hits a hard, judgement-shaped question that the docs cannot settle on their own. The tool sends a single focused question — plus a compact context bundle (active charm, caller-supplied hint, last few messages) — to a stronger reasoning model and returns the answer. The main session keeps running on its current model.

The intended uses are deliberately narrow:

Not for syntax lookups or routine implementation steps — the docs and the active skill cover those without paying the oracle tax. The system prompt names both lists so the agent reaches for the oracle only when the call is justified.

Defaults

Knob Default
Provider claude
Model claude-opus-4-7
Reasoning budget 8000 tokens
Output cap 4096 tokens
Temperature 0.2
Per-turn call cap 1
Per-session cost cap $2

The provider and model can be overridden per-session via state.oracle_provider_name and state.oracle_model. The caps live on AgentState too — raise them when a session genuinely benefits from more consults, not as a blanket policy.

Budget model

Two caps protect the session:

Either cap returns a structured tool error the agent sees and explains in its summary. No half-state: a refused call leaves the counters untouched.

Why not just compaction-aware in-line context?

Oracle is not a replacement for thoughtful prompting on the primary model. The pattern earns its keep when the agent has already spent context on the problem and a fresh heavyweight reading produces a better answer than yet more turns on the running session. Picking it for routine work would be expensive and pointless; picking it for one well-formed architecture question is the whole point.

The oracle’s answer does not enter state.messages. It comes back as a tool result, which means: (a) the main context window stays focused on the work in progress; (b) the agent must restate the recommendation in its next text reply rather than silently quoting the oracle. The transcript records the full exchange (question, context hint, answer, usage, cost) as an oracle_consult event so audits keep nothing lost.

Transcript events

Races and arenas emit structured events alongside the regular task updates. They land in the session transcript so a reviewer can reconstruct what happened after the fact.

race_confirm_requested
Emitted when the soft gate fires. Payload carries task_id, confirm_task_id, estimate_tokens, threshold_tokens, and the candidate id list.
race_downgraded
Emitted when a would-be race runs as a single subagent instead. The reason field is either over_budget (hard cap) or user_declined (answered no to a CONFIRM). Over-budget downgrades include estimate_tokens and budget_tokens so you can see why.
race_finished
One row per race. Carries the winner’s candidate_id and score, the candidate list, and elapsed_s. Empty winner fields mean every candidate failed.
race_candidate
One row per candidate, winner or loser. Includes the candidate’s exit_state, total score, and transcript_task_id. The transcript task id is <parent_task_id>__<candidate_id> — join against subagent_messages on that key to read any loser’s full tool-call trace, not just the winner’s.
oracle_consult
One row per Oracle call. Carries provider, model, the verbatim question and context_hint, the answer, the response usage dict, and cost_usd. calls_this_turn, calls_total, and session_cost_usd capture the budget meters at the moment of the call so an auditor can reconstruct cap-trip events without replaying the full session.

Current limits and planned work

See also: