tokenjam 0.5.2 n=50 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini
We ran the same 50 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 95% less than gpt-4o, but scored 10 points lower on this suite — 90% of tasks passed before, 80% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| 0 | pass | pass | ok |
| 1 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 2 | pass | pass | ok |
| 3 | pass | pass | ok |
| 4 | pass | pass | ok |
| 5 | pass | pass | ok |
| 6 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 7 | pass | pass | ok |
| 8 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 9 | fail | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 10 | fail | pass | ok |
| 11 | pass | pass | ok |
| 12 | pass | pass | ok |
| 13 | pass | pass | ok |
| 14 | pass | pass | ok |
| 15 | pass | pass | ok |
| 16 | pass | pass | ok |
| 17 | pass | pass | ok |
| 18 | pass | pass | ok |
| 19 | pass | pass | ok |
| 20 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 21 | pass | pass | ok |
| 22 | pass | pass | ok |
| 23 | pass | pass | ok |
| 24 | pass | pass | ok |
| 25 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 26 | pass | fail | NameError: name 'List' is not defined. Did you mean: 'list'? |
| 27 | pass | pass | ok |
| 28 | pass | pass | ok |
| 29 | pass | pass | ok |
| 30 | pass | pass | ok |
| 31 | pass | pass | ok |
| 32 | fail | fail | NameError: name 'poly' is not defined |
| 33 | pass | pass | ok |
| 34 | pass | pass | ok |
| 35 | pass | pass | ok |
| 36 | pass | pass | ok |
| 37 | pass | pass | ok |
| 38 | fail | fail | NameError: name 'encode_cyclic' is not defined. Did you mean: 'decode_cyclic'? |
| 39 | pass | pass | ok |
| 40 | pass | fail | AssertionError |
| 41 | fail | pass | ok |
| 42 | pass | pass | ok |
| 43 | pass | pass | ok |
| 44 | pass | pass | ok |
| 45 | pass | pass | ok |
| 46 | pass | pass | ok |
| 47 | pass | pass | ok |
| 48 | pass | pass | ok |
| 49 | pass | pass | ok |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.