TokenJam proof — judged

tokenjam 0.5.1   n=5 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Verdict: Insufficient evidence  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override
-75.3%
cost delta (measured)
+0.0pp
pass-rate delta [95% CI +0.0, +0.0]
0 / 0
tasks broken / fixed by the swap

Pass rate (95% CI whiskers)

Original2/5 (40%)
Candidate2/5 (40%)

Cost (measured)

Original$0.001400
Candidate$0.000345

Per-task

taskorigcandcandidate detail
judged/refund-policy0/10/1deepeval:correctness@deepseek:deepseek-chat score=0.00 (>= 0.5) — The actual output does n
judged/capital1/11/1deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output exactl
judged/retry-summary1/11/1deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output states
judged/shipping0/10/1deepeval:correctness@deepseek:deepseek-chat score=0.20 (>= 0.5) — The actual output adds c
judged/define-llm0/10/1deepeval:correctness@deepseek:deepseek-chat score=0.30 (>= 0.5) — The actual output descri

How to read this

Generated 2026-06-25 10:13 · tokenjam-bench · executable-accuracy proof