TokenJam proof — customer-support

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini

Switching looks safe

We ran the same 12 customer-support task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, and scored 8 points higher on this suite — 58% of tasks passed before, 67% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-94%
cheaper to run (measured API $)
+8 pts
accuracy vs the original model
2 worse · 3 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Refund windowpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50)
Delivery delayfailpassThe actual output acknowledges the issue with order #88412 and suggests checking graded 0.50 (needs 0.50)
Wrong productpassfailThe actual output acknowledges the order mix-up and provides steps to resolve it graded 0.40 (needs 0.50)
Password resetpasspassThe actual output provides a detailed and polite response with steps to resolve graded 0.50 (needs 0.50)
Subscription cancelfailpassThe actual output provides detailed instructions on how to cancel the subscripti graded 0.50 (needs 0.50)
Payment failedpasspassThe actual output provides a detailed explanation of potential reasons for payme graded 0.60 (needs 0.50)
Invoice requestfailfailThe actual output does not match the expected output in terms of factual informa graded 0.30 (needs 0.50)
Account lockedpasspassThe actual output accurately describes the account lock due to failed login atte graded 0.60 (needs 0.50)
Shipping statusfailpassThe actual output provides a general response about tracking the order, which al graded 0.60 (needs 0.50)
Feature requestpasspassThe actual output acknowledges the feedback and assures the user that their sugg graded 0.60 (needs 0.50)
Bug reportfailfailThe actual output provides troubleshooting steps for the CSV export issue, while graded 0.20 (needs 0.50)
EscalationpassfailThe actual output acknowledges the customer's frustration and offers assistance, graded 0.30 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+8.3pp
pass-rate delta [95% CI -27.9, +44.5]

Pass rate (95% CI whiskers)

Original7/12 (58%)
Candidate8/12 (67%)

Cost (measured)

Original$0.019840
Candidate$0.001175

How to read this

Generated 2026-06-26 13:44 · tokenjam-bench · proof report