TokenJam proof — humaneval
tokenjam 0.5.1 n=10 tasks · k=1 sample(s) ·
deepseek:deepseek-reasoner → deepseek:deepseek-chat
Verdict: No significant regression · McNemar p=1.000 (α=0.05)
· candidate chosen by explicit --candidate override
-43.9%
cost delta (measured)
+10.0pp
pass-rate delta [95% CI -8.6, +28.6]
0 / 1
tasks broken / fixed by the swap
Pass rate (95% CI whiskers)
Original9/10 (90%)
Candidate10/10 (100%)
Cost (measured)
Original$0.005847
Candidate$0.003283
Per-task
| task | orig | cand | | candidate detail |
| HumanEval/0 | 1/1 | 1/1 | | ok |
| HumanEval/1 | 0/1 | 1/1 | | ok |
| HumanEval/2 | 1/1 | 1/1 | | ok |
| HumanEval/3 | 1/1 | 1/1 | | ok |
| HumanEval/4 | 1/1 | 1/1 | | ok |
| HumanEval/5 | 1/1 | 1/1 | | ok |
| HumanEval/6 | 1/1 | 1/1 | | ok |
| HumanEval/7 | 1/1 | 1/1 | | ok |
| HumanEval/8 | 1/1 | 1/1 | | ok |
| HumanEval/9 | 1/1 | 1/1 | | ok |
How to read this
- Accuracy is the pass-rate on THIS benchmark suite, not a general “quality preserved” claim. Confidence is the CI + p-value, not a single “safe %”.
- A model had no TokenJam rate; cost used the $0.50/$2.00 default placeholder — savings are approximate.