TokenJam proof — enterprise-rag

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 43% less than deepseek-reasoner, and scored about the same on this suite — 8% of tasks passed before, 8% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.

-43%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
0 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Pto accrualfailfailThe actual output states it cannot answer the question, but the expected output graded 0.00 (needs 0.50)
Parental leavefailfailThe actual output claims up to 20 weeks of paid leave, while the expected output graded 0.00 (needs 0.50)
Oncall sev1failfailThe actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50)
Db restorefailfailThe actual output states there is no documented procedure, but the expected outp graded 0.00 (needs 0.50)
Data retentionfailfailThe actual output states no information is available, but the expected output sp graded 0.00 (needs 0.50)
Vendor securityfailfailThe actual output states no information is available, but the expected output sp graded 0.00 (needs 0.50)
Expense limitfailfailThe actual output states the information is not found, while the expected output graded 0.00 (needs 0.50)
Api rate limitfailfailThe actual output states that no information is available, but the expected outp graded 0.00 (needs 0.50)
Sso setupfailfailThe actual output states it cannot answer the question, while the expected outpu graded 0.00 (needs 0.50)
Incident postmortemfailfailThe actual output states that the documents do not contain the required informat graded 0.00 (needs 0.50)
Not coveredpasspassThe actual output correctly states that the documents lack information on the to graded 0.80 (needs 0.50)
Password rotationfailfailThe actual output states it cannot answer the question, while the expected outpu graded 0.00 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI +0.0, +0.0]

Pass rate (95% CI whiskers)

Original1/12 (8%)
Candidate1/12 (8%)

Cost (measured)

Original$0.002298
Candidate$0.001314

How to read this

Generated 2026-06-26 18:58 · tokenjam-bench · proof report