TokenJam proof — gsm8k

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 gsm8k task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 54% less than deepseek-reasoner, and scored about the same on this suite — 100% of tasks passed before, 100% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-54%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
0 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passpassmatched 18
1passpassmatched 3
2passpassmatched 70000
3passpassmatched 540
4passpassmatched 20
5passpassmatched 64
6passpassmatched 260
7passpassmatched 160
8passpassmatched 45
9passpassmatched 460
10passpassmatched 366
11passpassmatched 694

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI +0.0, +0.0]

Pass rate (95% CI whiskers)

Original12/12 (100%)
Candidate12/12 (100%)

Cost (measured)

Original$0.007698
Candidate$0.003507

How to read this

Generated 2026-06-26 18:51 · tokenjam-bench · proof report