tokenjam 0.5.1 n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 34% less than deepseek-reasoner, and scored 17 points higher on this suite — 0% of tasks passed before, 17% after.
No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.
| Task | Original | Cheaper model | Why — the judge’s reason |
|---|---|---|---|
| Synthesize trend | fail | fail | The actual output describes a general shift from experimentation to production a graded 0.30 (needs 0.50) |
| Compare conflicting | fail | fail | The actual output introduces specific statistics (13% faster, 8% more work) and graded 0.20 (needs 0.50) |
| Market sizing | fail | fail | The actual output does not provide any market estimate or reasoning, instead req graded 0.00 (needs 0.50) |
| Competitive positioning | fail | fail | The actual output states it cannot synthesize a comparison due to missing source graded 0.00 (needs 0.50) |
| Timeline extract | fail | fail | The actual output does not contain any timeline or milestones; instead, it asks graded 0.00 (needs 0.50) |
| Risk summary | fail | fail | The actual output does not contain the specific risks mentioned in the expected graded 0.00 (needs 0.50) |
| Define grounded | fail | pass | The actual output omits that net revenue retention is expressed as a percentage graded 0.60 (needs 0.50) |
| Abstain missing | fail | pass | The actual output correctly states that the sources lack a 2027 market share pro graded 1.00 (needs 0.50) |
| Two source agree | fail | fail | The actual output states it cannot determine the answer due to missing sources, graded 0.00 (needs 0.50) |
| Trend report | fail | fail | The actual output discusses a general trend toward automated evaluation tooling graded 0.20 (needs 0.50) |
| Quantify claim | fail | fail | The actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50) |
| Recommendation | fail | fail | The actual output provides a general recommendation for continuous benchmarking graded 0.40 (needs 0.50) |
McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.