TokenJam proof — enterprise-rag

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini

Switching looks safe

We ran the same 12 enterprise-rag task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 91% less than gpt-4o, and scored 8 points higher on this suite — 8% of tasks passed before, 17% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.

-91%
cheaper to run (measured API $)
+8 pts
accuracy vs the original model
0 worse · 1 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Pto accrualfailfailThe actual output fails to provide the specific details about PTO accrual and ro graded 0.20 (needs 0.50)
Parental leavefailfailThe actual output fails to provide the specific details about the parental leave graded 0.20 (needs 0.50)
Oncall sev1failpassThe actual output correctly states the 15-minute response time for a Sev-1 incid graded 0.60 (needs 0.50)
Db restorefailfailThe actual output does not provide any factual information or procedural steps r graded 0.20 (needs 0.50)
Data retentionfailfailThe actual output fails to provide the specific retention period of 30 days for graded 0.20 (needs 0.50)
Vendor securityfailfailThe actual output provides a general overview of the security review process, me graded 0.30 (needs 0.50)
Expense limitfailfailThe actual output incorrectly states the per-meal expense limit as $75, while th graded 0.30 (needs 0.50)
Api rate limitfailfailThe actual output inaccurately states the rate limit as 60 requests per minute i graded 0.30 (needs 0.50)
Sso setupfailfailThe actual output fails to provide the specific details about SSO protocols and graded 0.20 (needs 0.50)
Incident postmortemfailfailThe actual output fails to provide any of the specific factual details present i graded 0.10 (needs 0.50)
Not coveredpasspassThe actual output correctly identifies the lack of information on employee stock graded 0.70 (needs 0.50)
Password rotationfailfailThe actual output correctly identifies the lack of information on password rotat graded 0.30 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+8.3pp
pass-rate delta [95% CI -7.3, +24.0]

Pass rate (95% CI whiskers)

Original1/12 (8%)
Candidate2/12 (17%)

Cost (measured)

Original$0.004237
Candidate$0.000379

How to read this

Generated 2026-06-26 13:46 · tokenjam-bench · proof report