TokenJam proof — gsm8k

tokenjam 0.5.2   n=50 tasks · k=1 sample(s) · anthropic:claude-sonnet-4-6 → anthropic:claude-haiku-4-5

Switching looks safe

We ran the same 50 gsm8k task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 68% less than claude-sonnet-4-6, but scored 2 points lower on this suite — 98% of tasks passed before, 96% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-68%
cheaper to run (measured API $)
-2 pts
accuracy vs the original model
1 worse · 0 better
tasks changed by the swap (of 50)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
0passpassmatched 18
1passpassmatched 3
2passpassmatched 70000
3passpassmatched 540
4passpassmatched 20
5passpassmatched 64
6passpassmatched 260
7passpassmatched 160
8passpassmatched 45
9passpassmatched 460
10passpassmatched 366
11passpassmatched 694
12failfailexpected 13, got 12
13passpassmatched 18
14passpassmatched 60
15passpassmatched 125
16passpassmatched 230
17passpassmatched 57500
18passpassmatched 7
19passpassmatched 6
20passpassmatched 15
21passpassmatched 14
22passpassmatched 7
23passpassmatched 8
24passpassmatched 26
25passpassmatched 2
26passpassmatched 243
27passpassmatched 16
28passpassmatched 25
29passpassmatched 104
30passpassmatched 109
31passpassmatched 80
32passpassmatched 35
33passpassmatched 70
34passpassmatched 23
35passpassmatched 9
36passpassmatched 75
37passfailexpected 2, got 0
38passpassmatched 10
39passpassmatched 18
40passpassmatched 8
41passpassmatched 200
42passpassmatched 26
43passpassmatched 48
44passpassmatched 20
45passpassmatched 104
46passpassmatched 163
47passpassmatched 800
48passpassmatched 8
49passpassmatched 30

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-2.0pp
pass-rate delta [95% CI -5.9, +1.9]

Pass rate (95% CI whiskers)

Original49/50 (98%)
Candidate48/50 (96%)

Cost (measured)

Original$0.138756
Candidate$0.044054

How to read this

Generated 2026-06-26 13:06 · tokenjam-bench · proof report