tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 email-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 50% less than deepseek-reasoner, and scored about the same on this suite — 83% of tasks passed before, 83% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Reply reschedule | fail | fail | The actual output uses placeholders like [Client's Name] and [Day 1] instead of graded 0.20 (needs 0.50) |
| Reply refund firm | pass | pass | The actual output includes a formal email structure with subject line, salutatio graded 1.00 (needs 0.50) |
| Summarize thread | pass | pass | The actual output conveys the same meaning as the expected output, with only min graded 1.00 (needs 0.50) |
| Triage priority | pass | pass | The actual output adds 'Priority: Urgent' and 'Reason:' labels, and specifies 'c graded 1.00 (needs 0.50) |
| Triage spam | pass | pass | The actual output correctly identifies the email as spam/phishing and mentions t graded 0.80 (needs 0.50) |
| Extract meeting | pass | pass | The actual output omits the recipient's RevOps lead and instead mentions 'your R graded 0.50 (needs 0.50) |
| Extract actions | pass | pass | The actual output matches the expected output in content and meaning, with only graded 1.00 (needs 0.50) |
| Followup noreply | fail | fail | The actual output is a formal email with a subject line and salutation, while th graded 0.20 (needs 0.50) |
| Reply decline meeting | pass | pass | The actual output is more formal and detailed, but conveys the same core message graded 1.00 (needs 0.50) |
| Summarize exec | pass | pass | The actual output includes the same key facts (churn increase to 4.1% from 3.2%, graded 1.00 (needs 0.50) |
| Triage route | pass | pass | The actual output correctly identifies the triage result as Security and explain graded 0.90 (needs 0.50) |
| Reply sensitive data | pass | pass | The actual output is more formal and structured, but it conveys the same core me graded 1.00 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.