TokenJam proof — customer-support

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5

Switching looks safe

We ran the same 12 customer-support task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 92% less than claude-opus-4-7, but scored 8 points lower on this suite — 100% of tasks passed before, 92% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-92%
cheaper to run (measured API $)
-8 pts
accuracy vs the original model
1 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Refund windowpasspassThe actual output accurately conveys the same factual information and meaning as graded 0.90 (needs 0.50)
Delivery delaypasspassThe actual output closely matches the expected output in terms of factual inform graded 0.90 (needs 0.50)
Wrong productpasspassThe actual output accurately reflects the factual information and intent of the graded 0.90 (needs 0.50)
Password resetpasspassThe actual output closely aligns with the expected output in terms of factual in graded 0.90 (needs 0.50)
Subscription cancelpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Payment failedpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50)
Invoice requestpassfailThe actual output does not align with the expected output in terms of factual ac graded 0.10 (needs 0.50)
Account lockedpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Shipping statuspasspassThe actual output closely aligns with the expected output in terms of factual in graded 0.90 (needs 0.50)
Feature requestpasspassThe actual output matches the expected output exactly in terms of factual inform graded 1.00 (needs 0.50)
Bug reportpasspassThe actual output matches the expected output exactly in terms of factual accura graded 1.00 (needs 0.50)
EscalationpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-8.3pp
pass-rate delta [95% CI -24.0, +7.3]

Pass rate (95% CI whiskers)

Original12/12 (100%)
Candidate11/12 (92%)

Cost (measured)

Original$0.075920
Candidate$0.006056

How to read this

Generated 2026-06-26 13:33 · tokenjam-bench · proof report