tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5
We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 91% less than claude-opus-4-7, and scored 58 points higher on this suite — 42% of tasks passed before, 100% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Pto accrual | fail | pass | The actual output accurately reflects the factual information and meaning of the graded 1.00 (needs 0.50) |
| Parental leave | fail | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50) |
| Oncall sev1 | fail | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50) |
| Db restore | fail | pass | The actual output accurately reflects the expected output, maintaining factual a graded 0.90 (needs 0.50) |
| Data retention | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50) |
| Vendor security | pass | pass | The actual output accurately reflects the expected output, detailing the require graded 0.90 (needs 0.50) |
| Expense limit | pass | pass | The actual output accurately lists the per-meal limits for breakfast and lunch, graded 0.80 (needs 0.50) |
| Api rate limit | fail | pass | The actual output accurately reflects the expected output, maintaining factual a graded 1.00 (needs 0.50) |
| Sso setup | fail | pass | The actual output accurately reflects the factual information and meaning of the graded 0.90 (needs 0.50) |
| Incident postmortem | pass | pass | The actual output accurately reflects the expected output in terms of factual in graded 1.00 (needs 0.50) |
| Not covered | pass | pass | The actual output accurately conveys the lack of information on the company's po graded 0.80 (needs 0.50) |
| Password rotation | fail | pass | The actual output accurately reflects the key details of the expected output, in graded 0.90 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.