TokenJam proof — email-assistant

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5

Switching looks safe

We ran the same 12 email-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 85% less than claude-opus-4-7, and scored about the same on this suite — 100% of tasks passed before, 100% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-85%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
0 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Reply reschedulepasspassThe actual output is professional and aligns well with the expected output. It i graded 0.90 (needs 0.50)
Reply refund firmpasspassThe actual output aligns well with the expected output in terms of factual accur graded 0.90 (needs 0.50)
Summarize threadpasspassThe actual output closely aligns with the expected output, accurately reflecting graded 0.90 (needs 0.50)
Triage prioritypasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Triage spampasspassThe actual output closely aligns with the expected output, accurately identifyin graded 0.90 (needs 0.50)
Extract meetingpasspassThe actual output accurately reflects the factual information from the expected graded 0.80 (needs 0.50)
Extract actionspasspassThe actual output closely matches the expected output in terms of factual inform graded 0.80 (needs 0.50)
Followup noreplypasspassThe actual output closely aligns with the expected output in terms of semantic m graded 0.90 (needs 0.50)
Reply decline meetingpasspassThe actual output contains all the factual information present in the expected o graded 0.90 (needs 0.50)
Summarize execpasspassThe actual output closely aligns with the expected output in terms of factual in graded 0.90 (needs 0.50)
Triage routepasspassThe actual output aligns well with the expected output in terms of factual accur graded 0.90 (needs 0.50)
Reply sensitive datapasspassThe actual output accurately conveys the key points of the expected output, such graded 0.80 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI +0.0, +0.0]

Pass rate (95% CI whiskers)

Original12/12 (100%)
Candidate12/12 (100%)

Cost (measured)

Original$0.050835
Candidate$0.007694

How to read this

Generated 2026-06-26 13:38 · tokenjam-bench · proof report