TokenJam proof — humaneval

tokenjam 0.5.1   n=20 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override
-75.6%
cost delta (measured)
+5.0pp
pass-rate delta [95% CI -4.5, +14.6]
0 / 1
tasks broken / fixed by the swap

Pass rate (95% CI whiskers)

Original19/20 (95%)
Candidate20/20 (100%)

Cost (measured)

Original$0.015005
Candidate$0.003660

Per-task

taskorigcandcandidate detail
HumanEval/01/11/1ok
HumanEval/11/11/1ok
HumanEval/21/11/1ok
HumanEval/31/11/1ok
HumanEval/41/11/1ok
HumanEval/51/11/1ok
HumanEval/61/11/1ok
HumanEval/71/11/1ok
HumanEval/81/11/1ok
HumanEval/91/11/1ok
HumanEval/100/11/1ok
HumanEval/111/11/1ok
HumanEval/121/11/1ok
HumanEval/131/11/1ok
HumanEval/141/11/1ok
HumanEval/151/11/1ok
HumanEval/161/11/1ok
HumanEval/171/11/1ok
HumanEval/181/11/1ok
HumanEval/191/11/1ok

How to read this

Generated 2026-06-25 10:19 · tokenjam-bench · executable-accuracy proof