TokenJam proof — humaneval

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 54% less than deepseek-reasoner, and scored 8 points higher on this suite — 92% of tasks passed before, 100% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-54%
cheaper to run (measured API $)
+8 pts
accuracy vs the original model
0 worse · 1 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passpassok
1passpassok
2passpassok
3passpassok
4passpassok
5passpassok
6passpassok
7passpassok
8passpassok
9passpassok
10failpassok
11passpassok

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+8.3pp
pass-rate delta [95% CI -7.3, +24.0]

Pass rate (95% CI whiskers)

Original11/12 (92%)
Candidate12/12 (100%)

Cost (measured)

Original$0.008754
Candidate$0.004054

How to read this

Generated 2026-06-26 18:52 · tokenjam-bench · proof report