Score Cantrip against an eval spec
Use the bundled eval runner to drive Cantrip print-mode against a spec, score the result, and compare providers.
The eval suite under tests/eval/charms/ ships YAML specs that describe
what a charm should do (the prompt) and how to judge the result (the
rubric). Each spec sits beside zero or more gold-standard subdirectories
(gold-claude, gold-gemini, ...) plus any charm directories Cantrip
itself produced. The runner has four CLI verbs:
validate— every gold standard must score 100 %.score— score one charm directory against its spec's rubric.generate— drivecantrip run --printagainst the spec and emit a fresh charm directory.run—generatethenscorein one invocation.
Prerequisites
- A Cantrip checkout with
uv sync --devalready run. - The provider's API key exported in your shell —
ANTHROPIC_API_KEY,GEMINI_API_KEY,FIREWORKS_API_KEY, orOPENROUTER_API_KEY— same as for any normalcantrip runinvocation. See How to choose a provider for the full env-var catalogue. - The
cantripcommand on yourPATH. Inside the project venv this isuv run cantrip; outside, install Cantrip first.
Score one provider end to end
run is the shape Phase 79.4 added so a single command produces a
scored charm:
uv run python -m tests.eval.runner run \
tests/eval/charms/ntfy \
--provider claude \
--model opus-4.7
This:
- Picks a fresh subdirectory of
tests/eval/charms/ntfy/— the naming convention iscantrip-<provider>-<model-slug>-<YYYYMMDD-HHMMSS>so re-runs of the same model land in different directories without colliding with the gold standards. - Shells out to
cantrip run --print "<spec.prompt>" <charm-dir> --provider <X> --model <Y> --yolo.--yolois the default because print-mode refuses to start when there are pending CONFIRM tasks and an unattended eval has no way to answer them. Pass--no-tuiis implied by--printitself. - Hands the resulting charm directory to the rubric scorer and prints a Markdown report.
- Exits non-zero if the run produced any critical-severity failure, so CI invocations fail loudly.
A failed run that left no artefacts behind exits without scoring; the shell command Cantrip attempted is included in the error so you can re-run it interactively to see what happened.
Generate without scoring
generate runs only the print-mode step, which is what you want when
debugging the agent itself rather than measuring rubric coverage:
uv run python -m tests.eval.runner generate \
tests/eval/charms/ntfy \
--provider gemini
Pair it with score once you've inspected the result:
uv run python -m tests.eval.runner score \
tests/eval/charms/ntfy \
tests/eval/charms/ntfy/cantrip-gemini-default-20260509-123045
Compare providers side by side
Once two or more providers have generated charms, the compare verb
formats their rubric scores in a single table:
uv run python -m tests.eval.runner compare \
tests/eval/charms/ntfy \
tests/eval/charms/ntfy/gold-claude \
tests/eval/charms/ntfy/cantrip-gemini-default-20260509-123045
compare reads the same rubric file and produces both an overall and
per-category breakdown plus the failure list per run. Gold standards
score 100 % by definition, so a real-provider run sitting next to the
gold for the same charm is the cleanest way to read regression deltas.
Add a baseline directory
Phase 79.4 commits to growing gold-gemini / gold-fireworks /
gold-openrouter baselines over time. The recipe:
- Run
tests/eval/runner.py generateagainst the spec with the new provider. - Inspect the output, hand-tune any sharp edges, and rename the
directory to
gold-<provider>(e.g.gold-gemini). - Re-run
tests/eval/runner.py validate— if the new gold scores 100 %, commit it; otherwise iterate on the rubric or the charm until it does. - Add the resulting directory to the spec's containing folder; the runner picks it up automatically (no spec-file edits required).
Gold standards are checked into the repo so the rubric continues to score deterministically without any provider call.
Drive the runner from a script
tests.eval.runner.generate_and_score is the public entry point if you
want to integrate the loop into a Python harness:
from tests.eval.runner import generate_and_score
from tests.eval.spec import EvalSpec
spec = EvalSpec.load(pathlib.Path("tests/eval/charms/ntfy"))
generation, result = generate_and_score(
spec,
pathlib.Path("tests/eval/charms/ntfy"),
provider="claude",
model="opus-4.7",
)
if result is not None and result.critical_failures:
sys.exit(1)
Tests inject a fake runner callable (a stub subprocess.run) so the
harness exercise itself never burns tokens; see
tests/eval/test_runner_generate.py for the pattern.
Ablate the system prompt to find the load-bearing sections
Once a per-provider smoke gate is in place (see Phase 79.2 / 79.3), the
next question is which sections of the system prompt actually pull
their weight. tests/eval/ablate.py is the harness that answers it:
it drops each top-level ## Section of the rendered prompt one at a
time, reruns the same two smoke invariants the gate uses, and prints a
table showing where each ablation regresses.
uv run python -m tests.eval.ablate \
--provider openrouter \
--model openai/gpt-4o-mini
Output is a fixed-width report with one row per section plus a
(baseline) row for the unmodified prompt:
section tool_call non_empty delta
------------------------------------ ---------- ---------- -----------------
(baseline) ✓ ✓
Your Purpose ✓ ✓ no change
Tool Bundles ✗ ✓ -tool_call
Task Planning ✓ ✓ no change
…
+tool_call / -non_empty etc. report which invariants flipped from
the baseline; err: … means the provider call itself failed (cell
shows as ?) and the regression should not be blamed on that section.
The harness exits non-zero when at least one ablation lost a passing invariant — useful as a lightweight regression hint for future prompt-tuning sessions.
Cost is bounded: ~30 sections × 2 invariants × 1 baseline ≈ 62 model
calls, which is pennies on a cheap model. --list-sections prints
the parsed section names and exits without any provider calls — handy
for sanity-checking the parser after a prompt rewrite.