tokenjam 0.5.2 n=5 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5
We ran the same 5 judged task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 89% less than claude-opus-4-7, and scored about the same on this suite — 100% of tasks passed before, 100% after.
Too few tasks were run to be statistically sure either way. Run more before deciding.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Refund policy | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 1.00 (needs 0.50) |
| Capital | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 1.00 (needs 0.50) |
| Retry summary | pass | pass | The actual output accurately reflects the factual information and meaning of the graded 0.90 (needs 0.50) |
| Shipping | pass | pass | The actual output matches the expected output in terms of factual information, m graded 1.00 (needs 0.50) |
| Define llm | pass | pass | The actual output provides a more detailed explanation of a large language model graded 0.70 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.