tokenjam 0.5.2 n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini
We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 91% less than gpt-4o, and scored 8 points higher on this suite — 8% of tasks passed before, 17% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Pto accrual | fail | fail | The actual output fails to provide the specific details about PTO accrual and ro graded 0.20 (needs 0.50) |
| Parental leave | fail | fail | The actual output fails to provide the specific details about the parental leave graded 0.20 (needs 0.50) |
| Oncall sev1 | fail | pass | The actual output correctly states the 15-minute response time for a Sev-1 incid graded 0.60 (needs 0.50) |
| Db restore | fail | fail | The actual output does not provide any factual information or procedural steps r graded 0.20 (needs 0.50) |
| Data retention | fail | fail | The actual output fails to provide the specific retention period of 30 days for graded 0.20 (needs 0.50) |
| Vendor security | fail | fail | The actual output provides a general overview of the security review process, me graded 0.30 (needs 0.50) |
| Expense limit | fail | fail | The actual output incorrectly states the per-meal expense limit as $75, while th graded 0.30 (needs 0.50) |
| Api rate limit | fail | fail | The actual output inaccurately states the rate limit as 60 requests per minute i graded 0.30 (needs 0.50) |
| Sso setup | fail | fail | The actual output fails to provide the specific details about SSO protocols and graded 0.20 (needs 0.50) |
| Incident postmortem | fail | fail | The actual output fails to provide any of the specific factual details present i graded 0.10 (needs 0.50) |
| Not covered | pass | pass | The actual output correctly identifies the lack of information on employee stock graded 0.70 (needs 0.50) |
| Password rotation | fail | fail | The actual output correctly identifies the lack of information on password rotat graded 0.30 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.