tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini
We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, and scored about the same on this suite — 8% of tasks passed before, 8% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Synthesize trend | fail | fail | The actual output discusses trends in AI adoption for 2025, focusing on integrat graded 0.20 (needs 0.50) |
| Compare conflicting | fail | fail | The actual output discusses general perspectives on remote work productivity, wh graded 0.20 (needs 0.50) |
| Market sizing | fail | fail | The actual output provides a detailed methodology for estimating a serviceable m graded 0.20 (needs 0.50) |
| Competitive positioning | fail | fail | The actual output does not provide any factual information or comparison about t graded 0.20 (needs 0.50) |
| Timeline extract | fail | fail | The actual output does not provide the specific timeline details present in the graded 0.20 (needs 0.50) |
| Risk summary | fail | fail | The actual output does not align with the expected output in terms of factual co graded 0.00 (needs 0.50) |
| Define grounded | pass | pass | The actual output accurately describes net revenue retention, aligning with the graded 0.80 (needs 0.50) |
| Abstain missing | fail | fail | The actual output provides a detailed analysis of factors influencing market sha graded 0.30 (needs 0.50) |
| Two source agree | fail | fail | The actual output does not provide the specific factual information found in the graded 0.20 (needs 0.50) |
| Trend report | fail | fail | The actual output discusses trends in evaluation tooling with a focus on AI and graded 0.20 (needs 0.50) |
| Quantify claim | fail | fail | The actual output does not provide the specific factual breakdown of cost saving graded 0.30 (needs 0.50) |
| Recommendation | fail | fail | The actual output discusses the benefits of continuous benchmarking in a general graded 0.30 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.