tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini
We ran the same 12 customer-support task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, and scored 8 points higher on this suite — 58% of tasks passed before, 67% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Refund window | pass | pass | The actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50) |
| Delivery delay | fail | pass | The actual output acknowledges the issue with order #88412 and suggests checking graded 0.50 (needs 0.50) |
| Wrong product | pass | fail | The actual output acknowledges the order mix-up and provides steps to resolve it graded 0.40 (needs 0.50) |
| Password reset | pass | pass | The actual output provides a detailed and polite response with steps to resolve graded 0.50 (needs 0.50) |
| Subscription cancel | fail | pass | The actual output provides detailed instructions on how to cancel the subscripti graded 0.50 (needs 0.50) |
| Payment failed | pass | pass | The actual output provides a detailed explanation of potential reasons for payme graded 0.60 (needs 0.50) |
| Invoice request | fail | fail | The actual output does not match the expected output in terms of factual informa graded 0.30 (needs 0.50) |
| Account locked | pass | pass | The actual output accurately describes the account lock due to failed login atte graded 0.60 (needs 0.50) |
| Shipping status | fail | pass | The actual output provides a general response about tracking the order, which al graded 0.60 (needs 0.50) |
| Feature request | pass | pass | The actual output acknowledges the feedback and assures the user that their sugg graded 0.60 (needs 0.50) |
| Bug report | fail | fail | The actual output provides troubleshooting steps for the CSV export issue, while graded 0.20 (needs 0.50) |
| Escalation | pass | fail | The actual output acknowledges the customer's frustration and offers assistance, graded 0.30 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.