TokenJam proof — judged

tokenjam 0.5.1   n=5 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Not enough evidence yet

We ran the same 5 judged task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 59% less than deepseek-reasoner, and scored 20 points higher on this suite — 40% of tasks passed before, 60% after.

Too few tasks were run to be statistically sure either way. Run more before deciding.

-59%
cheaper to run (measured API $)
+20 pts
accuracy vs the original model
0 worse · 1 better
tasks changed by the swap (of 5)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Refund policyfailfailThe actual output does not provide the specific 30-day refund window from the ex graded 0.00 (needs 0.50)
CapitalpasspassThe actual output exactly matches the expected output, with no inaccuracies, omi graded 1.00 (needs 0.50)
Retry summarypasspassThe actual output states 'attempted the task five times' and 'ultimately failed' graded 1.00 (needs 0.50)
ShippingfailfailThe actual output does not provide a specific time frame of 5 to 7 business days graded 0.00 (needs 0.50)
Define llmfailpassThe actual output adds extra details about training on vast text data and genera graded 0.60 (needs 0.50)

The statistics behind it

Verdict: Insufficient evidence  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+20.0pp
pass-rate delta [95% CI -15.1, +55.1]

Pass rate (95% CI whiskers)

Original2/5 (40%)
Candidate3/5 (60%)

Cost (measured)

Original$0.001270
Candidate$0.000526

How to read this

Generated 2026-06-26 18:53 · tokenjam-bench · proof report