TokenJam proof — email-assistant

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini

Switching looks safe

We ran the same 12 email-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, but scored 8 points lower on this suite — 100% of tasks passed before, 92% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-94%
cheaper to run (measured API $)
-8 pts
accuracy vs the original model
1 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Reply reschedulepassfailThe actual output suggests rescheduling the demo with specific times, similar to graded 0.40 (needs 0.50)
Reply refund firmpasspassThe actual output accurately conveys the factual information and meaning of the graded 0.90 (needs 0.50)
Summarize threadpasspassThe actual output closely matches the expected output in terms of factual inform graded 0.90 (needs 0.50)
Triage prioritypasspassThe actual output accurately reflects the urgency and impact on EU customers, si graded 0.80 (needs 0.50)
Triage spampasspassThe actual output aligns well with the expected output in terms of identifying t graded 0.80 (needs 0.50)
Extract meetingpasspassThe actual output accurately reflects the date and time details as in the expect graded 0.70 (needs 0.50)
Extract actionspasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50)
Followup noreplypasspassThe actual output maintains the core intent of following up on a proposal and of graded 0.60 (needs 0.50)
Reply decline meetingpasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.80 (needs 0.50)
Summarize execpasspassThe actual output closely matches the expected output in terms of factual inform graded 0.90 (needs 0.50)
Triage routepasspassThe actual output aligns well with the expected output in terms of identifying t graded 0.80 (needs 0.50)
Reply sensitive datapasspassThe actual output correctly addresses the security concern of not sharing full c graded 0.60 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-8.3pp
pass-rate delta [95% CI -24.0, +7.3]

Pass rate (95% CI whiskers)

Original12/12 (100%)
Candidate11/12 (92%)

Cost (measured)

Original$0.012122
Candidate$0.000706

How to read this

Generated 2026-06-26 13:48 · tokenjam-bench · proof report