TokenJam proof — humaneval

tokenjam 0.5.2   n=50 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini

Switching looks safe

We ran the same 50 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 95% less than gpt-4o, but scored 10 points lower on this suite — 90% of tasks passed before, 80% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-95%
cheaper to run (measured API $)
-10 pts
accuracy vs the original model
7 worse · 2 better
tasks changed by the swap (of 50)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passpassok
1passfailNameError: name 'List' is not defined. Did you mean: 'list'?
2passpassok
3passpassok
4passpassok
5passpassok
6passfailNameError: name 'List' is not defined. Did you mean: 'list'?
7passpassok
8passfailNameError: name 'List' is not defined. Did you mean: 'list'?
9failfailNameError: name 'List' is not defined. Did you mean: 'list'?
10failpassok
11passpassok
12passpassok
13passpassok
14passpassok
15passpassok
16passpassok
17passpassok
18passpassok
19passpassok
20passfailNameError: name 'List' is not defined. Did you mean: 'list'?
21passpassok
22passpassok
23passpassok
24passpassok
25passfailNameError: name 'List' is not defined. Did you mean: 'list'?
26passfailNameError: name 'List' is not defined. Did you mean: 'list'?
27passpassok
28passpassok
29passpassok
30passpassok
31passpassok
32failfailNameError: name 'poly' is not defined
33passpassok
34passpassok
35passpassok
36passpassok
37passpassok
38failfailNameError: name 'encode_cyclic' is not defined. Did you mean: 'decode_cyclic'?
39passpassok
40passfailAssertionError
41failpassok
42passpassok
43passpassok
44passpassok
45passpassok
46passpassok
47passpassok
48passpassok
49passpassok

The statistics behind it

Verdict: No significant regression  ·  McNemar p=0.180 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-10.0pp
pass-rate delta [95% CI -21.4, +1.4]

Pass rate (95% CI whiskers)

Original45/50 (90%)
Candidate40/50 (80%)

Cost (measured)

Original$0.084517
Candidate$0.004287

How to read this

Generated 2026-06-26 13:06 · tokenjam-bench · proof report