TokenJam proof — humaneval

tokenjam 0.5.2   n=50 tasks · k=1 sample(s) · openai:o3 → openai:o4-mini

Switching looks safe

We ran the same 50 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (o4-mini) cost 88% less than o3, but scored 4 points lower on this suite — 88% of tasks passed before, 84% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-88%
cheaper to run (measured API $)
-4 pts
accuracy vs the original model
4 worse · 2 better
tasks changed by the swap (of 50)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passpassok
1passpassok
2passpassok
3failfailNameError: name 'List' is not defined. Did you mean: 'list'?
4passpassok
5passfailNameError: name 'List' is not defined. Did you mean: 'list'?
6failpassok
7failpassok
8passpassok
9passpassok
10passfailNameError: name 'is_palindrome' is not defined. Did you mean: 'make_palindrome'?
11passpassok
12passfailNameError: name 'List' is not defined. Did you mean: 'list'?
13passpassok
14passpassok
15passpassok
16passpassok
17passpassok
18passpassok
19passpassok
20failfailNameError: name 'List' is not defined. Did you mean: 'list'?
21passpassok
22passpassok
23passpassok
24passpassok
25passpassok
26passpassok
27passpassok
28passfailNameError: name 'List' is not defined. Did you mean: 'list'?
29passpassok
30passpassok
31passpassok
32failfailNameError: name 'math' is not defined
33passpassok
34passpassok
35passpassok
36passpassok
37passpassok
38failfailNameError: name 'encode_cyclic' is not defined. Did you mean: 'decode_cyclic'?
39passpassok
40passpassok
41passpassok
42passpassok
43passpassok
44passpassok
45passpassok
46passpassok
47passpassok
48passpassok
49passpassok

The statistics behind it

Verdict: No significant regression  ·  McNemar p=0.688 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-4.0pp
pass-rate delta [95% CI -13.5, +5.5]

Pass rate (95% CI whiskers)

Original44/50 (88%)
Candidate42/50 (84%)

Cost (measured)

Original$1.192450
Candidate$0.148118

How to read this

Generated 2026-06-26 13:22 · tokenjam-bench · proof report