TokenJam proof — humaneval

tokenjam 0.5.2   n=50 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5

Don't switch

We ran the same 50 humaneval task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 82% less than claude-opus-4-7, but scored 34 points lower on this suite — 90% of tasks passed before, 56% after.

The cheaper model was measurably worse on this suite — keep the original.

-82%
cheaper to run (measured API $)
-34 pts
accuracy vs the original model
17 worse · 0 better
tasks changed by the swap (of 50)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passfailNameError: name 'List' is not defined. Did you mean: 'list'?
1passfailNameError: name 'List' is not defined. Did you mean: 'list'?
2passpassok
3failfailNameError: name 'List' is not defined. Did you mean: 'list'?
4passfailNameError: name 'List' is not defined. Did you mean: 'list'?
5passfailNameError: name 'List' is not defined. Did you mean: 'list'?
6passfailNameError: name 'List' is not defined. Did you mean: 'list'?
7passfailNameError: name 'List' is not defined. Did you mean: 'list'?
8passfailNameError: name 'List' is not defined. Did you mean: 'list'?
9failfailNameError: name 'List' is not defined. Did you mean: 'list'?
10failfailNameError: name 'is_palindrome' is not defined. Did you mean: 'make_palindrome'?
11passpassok
12passfailNameError: name 'List' is not defined. Did you mean: 'list'?
13passpassok
14passfailNameError: name 'List' is not defined. Did you mean: 'list'?
15passpassok
16passpassok
17passfailNameError: name 'List' is not defined. Did you mean: 'list'?
18passpassok
19passpassok
20passfailNameError: name 'List' is not defined. Did you mean: 'list'?
21passfailNameError: name 'List' is not defined. Did you mean: 'list'?
22passfailNameError: name 'List' is not defined. Did you mean: 'list'?
23passpassok
24passpassok
25passfailNameError: name 'List' is not defined. Did you mean: 'list'?
26passfailNameError: name 'List' is not defined. Did you mean: 'list'?
27passpassok
28passfailNameError: name 'List' is not defined. Did you mean: 'list'?
29passfailNameError: name 'List' is not defined. Did you mean: 'list'?
30passpassok
31passpassok
32failfailNameError: name 'poly' is not defined
33passpassok
34passpassok
35passpassok
36passpassok
37passpassok
38failfailNameError: name 'encode_cyclic' is not defined. Did you mean: 'decode_cyclic'?
39passpassok
40passpassok
41passpassok
42passpassok
43passpassok
44passpassok
45passpassok
46passpassok
47passpassok
48passpassok
49passpassok

The statistics behind it

Verdict: Significant regression  ·  McNemar p=0.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-34.0pp
pass-rate delta [95% CI -47.1, -20.9]

Pass rate (95% CI whiskers)

Original45/50 (90%)
Candidate28/50 (56%)

Cost (measured)

Original$0.221305
Candidate$0.040702

How to read this

Generated 2026-06-26 13:01 · tokenjam-bench · proof report