Score Cantrip against an eval spec

Use the bundled eval runner to drive Cantrip print-mode against a spec, score the result, and compare providers.

The eval suite under tests/eval/charms/ ships YAML specs that describe what a charm should do (the prompt) and how to judge the result (the rubric). Each spec sits beside zero or more gold-standard subdirectories (gold-claude, gold-gemini, ...) plus any charm directories Cantrip itself produced. The runner has four CLI verbs:

Prerequisites

Score one provider end to end

run is the shape Phase 79.4 added so a single command produces a scored charm:

uv run python -m tests.eval.runner run \
    tests/eval/charms/ntfy \
    --provider claude \
    --model opus-4.7

This:

  1. Picks a fresh subdirectory of tests/eval/charms/ntfy/ — the naming convention is cantrip-<provider>-<model-slug>-<YYYYMMDD-HHMMSS> so re-runs of the same model land in different directories without colliding with the gold standards.
  2. Shells out to cantrip run --print "<spec.prompt>" <charm-dir> --provider <X> --model <Y> --yolo. --yolo is the default because print-mode refuses to start when there are pending CONFIRM tasks and an unattended eval has no way to answer them. Pass --no-tui is implied by --print itself.
  3. Hands the resulting charm directory to the rubric scorer and prints a Markdown report.
  4. Exits non-zero if the run produced any critical-severity failure, so CI invocations fail loudly.

A failed run that left no artefacts behind exits without scoring; the shell command Cantrip attempted is included in the error so you can re-run it interactively to see what happened.

Generate without scoring

generate runs only the print-mode step, which is what you want when debugging the agent itself rather than measuring rubric coverage:

uv run python -m tests.eval.runner generate \
    tests/eval/charms/ntfy \
    --provider gemini

Pair it with score once you've inspected the result:

uv run python -m tests.eval.runner score \
    tests/eval/charms/ntfy \
    tests/eval/charms/ntfy/cantrip-gemini-default-20260509-123045

Compare providers side by side

Once two or more providers have generated charms, the compare verb formats their rubric scores in a single table:

uv run python -m tests.eval.runner compare \
    tests/eval/charms/ntfy \
    tests/eval/charms/ntfy/gold-claude \
    tests/eval/charms/ntfy/cantrip-gemini-default-20260509-123045

compare reads the same rubric file and produces both an overall and per-category breakdown plus the failure list per run. Gold standards score 100 % by definition, so a real-provider run sitting next to the gold for the same charm is the cleanest way to read regression deltas.

Add a baseline directory

Phase 79.4 commits to growing gold-gemini / gold-fireworks / gold-openrouter baselines over time. The recipe:

  1. Run tests/eval/runner.py generate against the spec with the new provider.
  2. Inspect the output, hand-tune any sharp edges, and rename the directory to gold-<provider> (e.g. gold-gemini).
  3. Re-run tests/eval/runner.py validate — if the new gold scores 100 %, commit it; otherwise iterate on the rubric or the charm until it does.
  4. Add the resulting directory to the spec's containing folder; the runner picks it up automatically (no spec-file edits required).

Gold standards are checked into the repo so the rubric continues to score deterministically without any provider call.

Drive the runner from a script

tests.eval.runner.generate_and_score is the public entry point if you want to integrate the loop into a Python harness:

from tests.eval.runner import generate_and_score
from tests.eval.spec import EvalSpec

spec = EvalSpec.load(pathlib.Path("tests/eval/charms/ntfy"))
generation, result = generate_and_score(
    spec,
    pathlib.Path("tests/eval/charms/ntfy"),
    provider="claude",
    model="opus-4.7",
)

if result is not None and result.critical_failures:
    sys.exit(1)

Tests inject a fake runner callable (a stub subprocess.run) so the harness exercise itself never burns tokens; see tests/eval/test_runner_generate.py for the pattern.

Ablate the system prompt to find the load-bearing sections

Once a per-provider smoke gate is in place (see Phase 79.2 / 79.3), the next question is which sections of the system prompt actually pull their weight. tests/eval/ablate.py is the harness that answers it: it drops each top-level ## Section of the rendered prompt one at a time, reruns the same two smoke invariants the gate uses, and prints a table showing where each ablation regresses.

uv run python -m tests.eval.ablate \
    --provider openrouter \
    --model openai/gpt-4o-mini

Output is a fixed-width report with one row per section plus a (baseline) row for the unmodified prompt:

section                               tool_call   non_empty   delta
------------------------------------  ----------  ----------  -----------------
(baseline)                            ✓           ✓
Your Purpose                          ✓           ✓           no change
Tool Bundles                          ✗           ✓           -tool_call
Task Planning                         ✓           ✓           no change
…

+tool_call / -non_empty etc. report which invariants flipped from the baseline; err: … means the provider call itself failed (cell shows as ?) and the regression should not be blamed on that section.

The harness exits non-zero when at least one ablation lost a passing invariant — useful as a lightweight regression hint for future prompt-tuning sessions.

Cost is bounded: ~30 sections × 2 invariants × 1 baseline ≈ 62 model calls, which is pennies on a cheap model. --list-sections prints the parsed section names and exits without any provider calls — handy for sanity-checking the parser after a prompt rewrite.