tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 43% less than deepseek-reasoner, and scored about the same on this suite — 8% of tasks passed before, 8% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Pto accrual | fail | fail | The actual output states it cannot answer the question, but the expected output graded 0.00 (needs 0.50) |
| Parental leave | fail | fail | The actual output claims up to 20 weeks of paid leave, while the expected output graded 0.00 (needs 0.50) |
| Oncall sev1 | fail | fail | The actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50) |
| Db restore | fail | fail | The actual output states there is no documented procedure, but the expected outp graded 0.00 (needs 0.50) |
| Data retention | fail | fail | The actual output states no information is available, but the expected output sp graded 0.00 (needs 0.50) |
| Vendor security | fail | fail | The actual output states no information is available, but the expected output sp graded 0.00 (needs 0.50) |
| Expense limit | fail | fail | The actual output states the information is not found, while the expected output graded 0.00 (needs 0.50) |
| Api rate limit | fail | fail | The actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50) |
| Sso setup | fail | fail | The actual output states it cannot answer the question, while the expected outpu graded 0.00 (needs 0.50) |
| Incident postmortem | fail | fail | The actual output states that the documents do not contain the required informat graded 0.00 (needs 0.50) |
| Not covered | pass | pass | The actual output correctly states that the documents lack information on the to graded 0.80 (needs 0.50) |
| Password rotation | fail | fail | The actual output states it cannot answer the question, while the expected outpu graded 0.00 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.