TokenJam proof — judged

tokenjam 0.5.1   n=5 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Verdict: Insufficient evidence  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override
-47.6%
cost delta (measured)
+20.0pp
pass-rate delta [95% CI -15.1, +55.1]
0 / 1
tasks broken / fixed by the swap

Pass rate (95% CI whiskers)

Original2/5 (40%)
Candidate3/5 (60%)

Cost (measured)

Original$0.001080
Candidate$0.000566

Per-task

taskorigcandcandidate detail
judged/refund-policy0/10/1deepeval:correctness@deepseek:deepseek-chat score=0.00 (>= 0.5) — The actual output does n
judged/capital1/11/1deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output exactl
judged/retry-summary1/11/1deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output adds u
judged/shipping0/10/1deepeval:correctness@deepseek:deepseek-chat score=0.00 (>= 0.5) — The actual output does n
judged/define-llm0/11/1deepeval:correctness@deepseek:deepseek-chat score=0.50 (>= 0.5) — The actual output includ

How to read this

Generated 2026-06-26 08:50 · tokenjam-bench · executable-accuracy proof