tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5
We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 92% less than claude-opus-4-7, but scored 17 points lower on this suite — 75% of tasks passed before, 58% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Synthesize trend | pass | pass | The actual output aligns well with the expected output in terms of factual accur graded 0.90 (needs 0.50) |
| Compare conflicting | pass | pass | The actual output accurately reflects the key factual details from the expected graded 0.90 (needs 0.50) |
| Market sizing | pass | pass | The actual output accurately reflects the key figures and calculations from the graded 0.80 (needs 0.50) |
| Competitive positioning | fail | fail | The actual output does not provide the expected competitive positioning summary. graded 0.20 (needs 0.50) |
| Timeline extract | pass | pass | The actual output accurately presents the factual information and semantic meani graded 0.80 (needs 0.50) |
| Risk summary | pass | fail | The actual output correctly identifies the lack of provided sources, which is a graded 0.30 (needs 0.50) |
| Define grounded | pass | pass | The actual output matches the expected output exactly, with no discrepancies in graded 1.00 (needs 0.50) |
| Abstain missing | pass | pass | The actual output accurately identifies the lack of 2027 market share projection graded 0.90 (needs 0.50) |
| Two source agree | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50) |
| Trend report | fail | fail | The actual output does not provide the factual information or semantic content o graded 0.20 (needs 0.50) |
| Quantify claim | pass | fail | The actual output does not provide any factual information or breakdown of cost graded 0.20 (needs 0.50) |
| Recommendation | fail | fail | The actual output does not provide the specific recommendation found in the expe graded 0.20 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.