TokenJam proof — research-assistant

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · openai:gpt-4o → openai:gpt-4o-mini

Switching looks safe

We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (gpt-4o-mini) cost 94% less than gpt-4o, and scored about the same on this suite — 8% of tasks passed before, 8% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost. Note: both models scored low on this hard suite, so spot-check the answers before relying on either.

-94%
cheaper to run (measured API $)
+0 pts
accuracy vs the original model
0 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Synthesize trendfailfailThe actual output discusses trends in AI adoption for 2025, focusing on integrat graded 0.20 (needs 0.50)
Compare conflictingfailfailThe actual output discusses general perspectives on remote work productivity, wh graded 0.20 (needs 0.50)
Market sizingfailfailThe actual output provides a detailed methodology for estimating a serviceable m graded 0.20 (needs 0.50)
Competitive positioningfailfailThe actual output does not provide any factual information or comparison about t graded 0.20 (needs 0.50)
Timeline extractfailfailThe actual output does not provide the specific timeline details present in the graded 0.20 (needs 0.50)
Risk summaryfailfailThe actual output does not align with the expected output in terms of factual co graded 0.00 (needs 0.50)
Define groundedpasspassThe actual output accurately describes net revenue retention, aligning with the graded 0.80 (needs 0.50)
Abstain missingfailfailThe actual output provides a detailed analysis of factors influencing market sha graded 0.30 (needs 0.50)
Two source agreefailfailThe actual output does not provide the specific factual information found in the graded 0.20 (needs 0.50)
Trend reportfailfailThe actual output discusses trends in evaluation tooling with a focus on AI and graded 0.20 (needs 0.50)
Quantify claimfailfailThe actual output does not provide the specific factual breakdown of cost saving graded 0.30 (needs 0.50)
RecommendationfailfailThe actual output discusses the benefits of continuous benchmarking in a general graded 0.30 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=1.000 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

+0.0pp
pass-rate delta [95% CI +0.0, +0.0]

Pass rate (95% CI whiskers)

Original1/12 (8%)
Candidate1/12 (8%)

Cost (measured)

Original$0.027948
Candidate$0.001661

How to read this

Generated 2026-06-26 13:51 · tokenjam-bench · proof report