TokenJam proof — gsm8k

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override
-57.0%
cost delta (measured)
+0.0pp
pass-rate delta [95% CI +0.0, +0.0]
0 / 0
tasks broken / fixed by the swap

Pass rate (95% CI whiskers)

Original12/12 (100%)
Candidate12/12 (100%)

Cost (measured)

Original$0.007900
Candidate$0.003394

Per-task

taskorigcandcandidate detail
gsm8k/01/11/1matched 18
gsm8k/11/11/1matched 3
gsm8k/21/11/1matched 70000
gsm8k/31/11/1matched 540
gsm8k/41/11/1matched 20
gsm8k/51/11/1matched 64
gsm8k/61/11/1matched 260
gsm8k/71/11/1matched 160
gsm8k/81/11/1matched 45
gsm8k/91/11/1matched 460
gsm8k/101/11/1matched 366
gsm8k/111/11/1matched 694

How to read this

Generated 2026-06-26 08:48 · tokenjam-bench · executable-accuracy proof