tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 customer-support task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 47% less than deepseek-reasoner, and scored about the same on this suite — 8% of tasks passed before, 8% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Refund window | fail | fail | The actual output does not confirm the return window and instead asks the custom graded 0.00 (needs 0.50) |
| Delivery delay | fail | fail | The actual output does not match the expected output's key facts: it states stan graded 0.20 (needs 0.50) |
| Wrong product | fail | fail | The actual output suggests waiting for the return before sending a replacement o graded 0.20 (needs 0.50) |
| Password reset | fail | fail | The actual output provides a detailed step-by-step guide, but contradicts the ex graded 0.30 (needs 0.50) |
| Subscription cancel | fail | fail | The actual output adds unnecessary details about account verification and policy graded 0.30 (needs 0.50) |
| Payment failed | fail | fail | The actual output provides a detailed list of possible reasons and steps, while graded 0.40 (needs 0.50) |
| Invoice request | fail | fail | The actual output asks for confirmation of company details before processing, wh graded 0.00 (needs 0.50) |
| Account locked | fail | fail | The actual output does not provide the specific policy detail that accounts auto graded 0.10 (needs 0.50) |
| Shipping status | fail | fail | The actual output does not provide any tracking information or estimated deliver graded 0.00 (needs 0.50) |
| Feature request | fail | pass | The actual output is more detailed and formal, but it does not convey the key pr graded 0.60 (needs 0.50) |
| Bug report | fail | fail | The actual output provides troubleshooting steps for the CSV export issue, while graded 0.10 (needs 0.50) |
| Escalation | pass | fail | The actual output does not match the expected output's commitment to immediate e graded 0.20 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.