tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 gsm8k task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 54% less than deepseek-reasoner, and scored about the same on this suite — 100% of tasks passed before, 100% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| 0 | pass | pass | matched 18 |
| 1 | pass | pass | matched 3 |
| 2 | pass | pass | matched 70000 |
| 3 | pass | pass | matched 540 |
| 4 | pass | pass | matched 20 |
| 5 | pass | pass | matched 64 |
| 6 | pass | pass | matched 260 |
| 7 | pass | pass | matched 160 |
| 8 | pass | pass | matched 45 |
| 9 | pass | pass | matched 460 |
| 10 | pass | pass | matched 366 |
| 11 | pass | pass | matched 694 |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.