TokenJam proof — research-assistant

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 34% less than deepseek-reasoner, and scored 17 points higher on this suite — 0% of tasks passed before, 17% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.

-34%
cheaper to run (measured API $)
+17 pts
accuracy vs the original model
0 worse · 2 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Synthesize trendfailfailThe actual output describes a general shift from experimentation to production a graded 0.30 (needs 0.50)
Compare conflictingfailfailThe actual output introduces specific statistics (13% faster, 8% more work) and graded 0.20 (needs 0.50)
Market sizingfailfailThe actual output does not provide any market estimate or reasoning, instead req graded 0.00 (needs 0.50)
Competitive positioningfailfailThe actual output states it cannot synthesize a comparison due to missing source graded 0.00 (needs 0.50)
Timeline extractfailfailThe actual output does not contain any timeline or milestones; instead, it asks graded 0.00 (needs 0.50)
Risk summaryfailfailThe actual output does not contain the specific risks mentioned in the expected graded 0.00 (needs 0.50)
Define groundedfailpassThe actual output omits that net revenue retention is expressed as a percentage graded 0.60 (needs 0.50)
Abstain missingfailpassThe actual output correctly states that the sources lack a 2027 market share pro graded 1.00 (needs 0.50)
Two source agreefailfailThe actual output states it cannot determine the answer due to missing sources, graded 0.00 (needs 0.50)
Trend reportfailfailThe actual output discusses a general trend toward automated evaluation tooling graded 0.20 (needs 0.50)
Quantify claimfailfailThe actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50)
RecommendationfailfailThe actual output provides a general recommendation for continuous benchmarking graded 0.40 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=0.500 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+16.7pp
pass-rate delta [95% CI -4.4, +37.8]

Pass rate (95% CI whiskers)

Original0/12 (0%)
Candidate2/12 (17%)

Cost (measured)

Original$0.004632
Candidate$0.003068

How to read this

Generated 2026-06-26 19:03 · tokenjam-bench · proof report