tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini
We ran the same 12 email-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, but scored 8 points lower on this suite — 100% of tasks passed before, 92% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Reply reschedule | pass | fail | The actual output suggests rescheduling the demo with specific times, similar to graded 0.40 (needs 0.50) |
| Reply refund firm | pass | pass | The actual output accurately conveys the factual information and meaning of the graded 0.90 (needs 0.50) |
| Summarize thread | pass | pass | The actual output closely matches the expected output in terms of factual inform graded 0.90 (needs 0.50) |
| Triage priority | pass | pass | The actual output accurately reflects the urgency and impact on EU customers, si graded 0.80 (needs 0.50) |
| Triage spam | pass | pass | The actual output aligns well with the expected output in terms of identifying t graded 0.80 (needs 0.50) |
| Extract meeting | pass | pass | The actual output accurately reflects the date and time details as in the expect graded 0.70 (needs 0.50) |
| Extract actions | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50) |
| Followup noreply | pass | pass | The actual output maintains the core intent of following up on a proposal and of graded 0.60 (needs 0.50) |
| Reply decline meeting | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50) |
| Summarize exec | pass | pass | The actual output closely matches the expected output in terms of factual inform graded 0.90 (needs 0.50) |
| Triage route | pass | pass | The actual output aligns well with the expected output in terms of identifying t graded 0.80 (needs 0.50) |
| Reply sensitive data | pass | pass | The actual output correctly addresses the security concern of not sharing full c graded 0.60 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.