TokenJam proof — enterprise-rag

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5

Switching looks safe

We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 91% less than claude-opus-4-7, and scored 58 points higher on this suite — 42% of tasks passed before, 100% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-91%
cheaper to run (measured API $)
+58 pts
accuracy vs the original model
0 worse · 7 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Pto accrualfailpassThe actual output accurately reflects the factual information and meaning of the graded 1.00 (needs 0.50)
Parental leavefailpassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Oncall sev1failpassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Db restorefailpassThe actual output accurately reflects the expected output, maintaining factual a graded 0.90 (needs 0.50)
Data retentionpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Vendor securitypasspassThe actual output accurately reflects the expected output, detailing the require graded 0.90 (needs 0.50)
Expense limitpasspassThe actual output accurately lists the per-meal limits for breakfast and lunch, graded 0.80 (needs 0.50)
Api rate limitfailpassThe actual output accurately reflects the expected output, maintaining factual a graded 1.00 (needs 0.50)
Sso setupfailpassThe actual output accurately reflects the factual information and meaning of the graded 0.90 (needs 0.50)
Incident postmortempasspassThe actual output accurately reflects the expected output in terms of factual in graded 1.00 (needs 0.50)
Not coveredpasspassThe actual output accurately conveys the lack of information on the company's po graded 0.80 (needs 0.50)
Password rotationfailpassThe actual output accurately reflects the key details of the expected output, in graded 0.90 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=0.016 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+58.3pp
pass-rate delta [95% CI +30.4, +86.2]

Pass rate (95% CI whiskers)

Original5/12 (42%)
Candidate12/12 (100%)

Cost (measured)

Original$0.072525
Candidate$0.006791

How to read this

Generated 2026-06-26 13:36 · tokenjam-bench · proof report