tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 54% less than deepseek-reasoner, and scored 8 points higher on this suite — 92% of tasks passed before, 100% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| 0 | pass | pass | ok |
| 1 | pass | pass | ok |
| 2 | pass | pass | ok |
| 3 | pass | pass | ok |
| 4 | pass | pass | ok |
| 5 | pass | pass | ok |
| 6 | pass | pass | ok |
| 7 | pass | pass | ok |
| 8 | pass | pass | ok |
| 9 | pass | pass | ok |
| 10 | fail | pass | ok |
| 11 | pass | pass | ok |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.