tokenjam 0.5.1 n=5 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 5 judged task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 59% less than deepseek-reasoner, and scored 20 points higher on this suite — 40% of tasks passed before, 60% after.
Too few tasks were run to be statistically sure either way. Run more before deciding.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Refund policy | fail | fail | The actual output does not provide the specific 30-day refund window from the ex graded 0.00 (needs 0.50) |
| Capital | pass | pass | The actual output exactly matches the expected output, with no inaccuracies, omi graded 1.00 (needs 0.50) |
| Retry summary | pass | pass | The actual output states 'attempted the task five times' and 'ultimately failed' graded 1.00 (needs 0.50) |
| Shipping | fail | fail | The actual output does not provide a specific time frame of 5 to 7 business days graded 0.00 (needs 0.50) |
| Define llm | fail | pass | The actual output adds extra details about training on vast text data and genera graded 0.60 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.