tokenjam 0.5.2 n=50 tasks · k=1 sample(s) · openai:o3 → openai:o4-mini
We ran the same 50 gsm8k task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (o4-mini) cost 87% less than o3, and scored 2 points higher on this suite — 98% of tasks passed before, 100% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| 0 | pass | pass | matched 18 |
| 1 | pass | pass | matched 3 |
| 2 | pass | pass | matched 70000 |
| 3 | pass | pass | matched 540 |
| 4 | pass | pass | matched 20 |
| 5 | pass | pass | matched 64 |
| 6 | pass | pass | matched 260 |
| 7 | pass | pass | matched 160 |
| 8 | pass | pass | matched 45 |
| 9 | pass | pass | matched 460 |
| 10 | pass | pass | matched 366 |
| 11 | pass | pass | matched 694 |
| 12 | pass | pass | matched 13 |
| 13 | pass | pass | matched 18 |
| 14 | pass | pass | matched 60 |
| 15 | pass | pass | matched 125 |
| 16 | pass | pass | matched 230 |
| 17 | pass | pass | matched 57500 |
| 18 | pass | pass | matched 7 |
| 19 | pass | pass | matched 6 |
| 20 | pass | pass | matched 15 |
| 21 | pass | pass | matched 14 |
| 22 | pass | pass | matched 7 |
| 23 | pass | pass | matched 8 |
| 24 | pass | pass | matched 26 |
| 25 | pass | pass | matched 2 |
| 26 | pass | pass | matched 243 |
| 27 | pass | pass | matched 16 |
| 28 | pass | pass | matched 25 |
| 29 | pass | pass | matched 104 |
| 30 | pass | pass | matched 109 |
| 31 | pass | pass | matched 80 |
| 32 | pass | pass | matched 35 |
| 33 | pass | pass | matched 70 |
| 34 | pass | pass | matched 23 |
| 35 | pass | pass | matched 9 |
| 36 | pass | pass | matched 75 |
| 37 | pass | pass | matched 2 |
| 38 | pass | pass | matched 10 |
| 39 | pass | pass | matched 18 |
| 40 | fail | pass | matched 8 |
| 41 | pass | pass | matched 200 |
| 42 | pass | pass | matched 26 |
| 43 | pass | pass | matched 48 |
| 44 | pass | pass | matched 20 |
| 45 | pass | pass | matched 104 |
| 46 | pass | pass | matched 163 |
| 47 | pass | pass | matched 800 |
| 48 | pass | pass | matched 8 |
| 49 | pass | pass | matched 30 |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.