TokenJam proof — humaneval

tokenjam 0.5.1   n=10 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override
-43.9%
cost delta (measured)
+10.0pp
pass-rate delta [95% CI -8.6, +28.6]
0 / 1
tasks broken / fixed by the swap

Pass rate (95% CI whiskers)

Original9/10 (90%)
Candidate10/10 (100%)

Cost (measured)

Original$0.005847
Candidate$0.003283

Per-task

taskorigcandcandidate detail
HumanEval/01/11/1ok
HumanEval/10/11/1ok
HumanEval/21/11/1ok
HumanEval/31/11/1ok
HumanEval/41/11/1ok
HumanEval/51/11/1ok
HumanEval/61/11/1ok
HumanEval/71/11/1ok
HumanEval/81/11/1ok
HumanEval/91/11/1ok

How to read this

Generated 2026-06-26 08:49 · tokenjam-bench · executable-accuracy proof