TokenJam proof — customer-support

tokenjam 0.5.1   n=12 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat

Switching looks safe

We ran the same 12 customer-support task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (deepseek-chat) cost 47% less than deepseek-reasoner, and scored about the same on this suite — 8% of tasks passed before, 8% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.

-47%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
1 worse · 1 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Refund windowfailfailThe actual output does not confirm the return window and instead asks the custom graded 0.00 (needs 0.50)
Delivery delayfailfailThe actual output does not match the expected output's key facts: it states stan graded 0.20 (needs 0.50)
Wrong productfailfailThe actual output suggests waiting for the return before sending a replacement o graded 0.20 (needs 0.50)
Password resetfailfailThe actual output provides a detailed step-by-step guide, but contradicts the ex graded 0.30 (needs 0.50)
Subscription cancelfailfailThe actual output adds unnecessary details about account verification and policy graded 0.30 (needs 0.50)
Payment failedfailfailThe actual output provides a detailed list of possible reasons and steps, while graded 0.40 (needs 0.50)
Invoice requestfailfailThe actual output asks for confirmation of company details before processing, wh graded 0.00 (needs 0.50)
Account lockedfailfailThe actual output does not provide the specific policy detail that accounts auto graded 0.10 (needs 0.50)
Shipping statusfailfailThe actual output does not provide any tracking information or estimated deliver graded 0.00 (needs 0.50)
Feature requestfailpassThe actual output is more detailed and formal, but it does not convey the key pr graded 0.60 (needs 0.50)
Bug reportfailfailThe actual output provides troubleshooting steps for the CSV export issue, while graded 0.10 (needs 0.50)
EscalationpassfailThe actual output does not match the expected output's commitment to immediate e graded 0.20 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI -23.1, +23.1]

Pass rate (95% CI whiskers)

Original1/12 (8%)
Candidate1/12 (8%)

Cost (measured)

Original$0.009207
Candidate$0.004861

How to read this

Generated 2026-06-26 18:56 · tokenjam-bench · proof report