TokenJam proof — gsm8k
tokenjam 0.5.1 n=12 tasks · k=1 sample(s) ·
deepseek:deepseek-reasoner → deepseek:deepseek-chat
Verdict: No significant regression · McNemar p=1.000 (α=0.05)
· candidate chosen by explicit --candidate override
-57.0%
cost delta (measured)
+0.0pp
pass-rate delta [95% CI +0.0, +0.0]
0 / 0
tasks broken / fixed by the swap
Pass rate (95% CI whiskers)
Original12/12 (100%)
Candidate12/12 (100%)
Cost (measured)
Original$0.007900
Candidate$0.003394
Per-task
| task | orig | cand | | candidate detail |
| gsm8k/0 | 1/1 | 1/1 | | matched 18 |
| gsm8k/1 | 1/1 | 1/1 | | matched 3 |
| gsm8k/2 | 1/1 | 1/1 | | matched 70000 |
| gsm8k/3 | 1/1 | 1/1 | | matched 540 |
| gsm8k/4 | 1/1 | 1/1 | | matched 20 |
| gsm8k/5 | 1/1 | 1/1 | | matched 64 |
| gsm8k/6 | 1/1 | 1/1 | | matched 260 |
| gsm8k/7 | 1/1 | 1/1 | | matched 160 |
| gsm8k/8 | 1/1 | 1/1 | | matched 45 |
| gsm8k/9 | 1/1 | 1/1 | | matched 460 |
| gsm8k/10 | 1/1 | 1/1 | | matched 366 |
| gsm8k/11 | 1/1 | 1/1 | | matched 694 |
How to read this
- Accuracy is the pass-rate on THIS benchmark suite, not a general “quality preserved” claim. Confidence is the CI + p-value, not a single “safe %”.
- A model had no TokenJam rate; cost used the $0.50/$2.00 default placeholder — savings are approximate.