TokenJam proof — email-assistant

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 email-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 50% less than deepseek-reasoner, and scored about the same on this suite — 83% of tasks passed before, 83% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-50%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
0 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Reply reschedulefailfailThe actual output uses placeholders like [Client's Name] and [Day 1] instead of graded 0.20 (needs 0.50)
Reply refund firmpasspassThe actual output includes a formal email structure with subject line, salutatio graded 1.00 (needs 0.50)
Summarize threadpasspassThe actual output conveys the same meaning as the expected output, with only min graded 1.00 (needs 0.50)
Triage prioritypasspassThe actual output adds 'Priority: Urgent' and 'Reason:' labels, and specifies 'c graded 1.00 (needs 0.50)
Triage spampasspassThe actual output correctly identifies the email as spam/phishing and mentions t graded 0.80 (needs 0.50)
Extract meetingpasspassThe actual output omits the recipient's RevOps lead and instead mentions 'your R graded 0.50 (needs 0.50)
Extract actionspasspassThe actual output matches the expected output in content and meaning, with only graded 1.00 (needs 0.50)
Followup noreplyfailfailThe actual output is a formal email with a subject line and salutation, while th graded 0.20 (needs 0.50)
Reply decline meetingpasspassThe actual output is more formal and detailed, but conveys the same core message graded 1.00 (needs 0.50)
Summarize execpasspassThe actual output includes the same key facts (churn increase to 4.1% from 3.2%, graded 1.00 (needs 0.50)
Triage routepasspassThe actual output correctly identifies the triage result as Security and explain graded 0.90 (needs 0.50)
Reply sensitive datapasspassThe actual output is more formal and structured, but it conveys the same core me graded 1.00 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI +0.0, +0.0]

Pass rate (95% CI whiskers)

Original10/12 (83%)
Candidate10/12 (83%)

Cost (measured)

Original$0.005185
Candidate$0.002591

How to read this

Generated 2026-06-26 19:00 · tokenjam-bench · proof report