TokenJam proof — research-assistant

tokenjam 0.5.2   n=12 tasks · k=1 sample(s) · anthropic:claude-opus-4-7 → anthropic:claude-haiku-4-5

Switching looks safe

We ran the same 12 research-assistant task(s) through both models and graded every answer with the same automated judge. The cheaper candidate (claude-haiku-4-5) cost 92% less than claude-opus-4-7, but scored 17 points lower on this suite — 75% of tasks passed before, 58% after.

No statistically significant quality drop was detected, so moving to the cheaper model is a reasonable way to cut cost.

-92%
cheaper to run (measured API $)
-17 pts
accuracy vs the original model
2 worse · 0 better
tasks changed by the swap (of 12)

How each task did

TaskOriginalCheaper modelWhy — the judge’s reason
Synthesize trendpasspassThe actual output aligns well with the expected output in terms of factual accur graded 0.90 (needs 0.50)
Compare conflictingpasspassThe actual output accurately reflects the key factual details from the expected graded 0.90 (needs 0.50)
Market sizingpasspassThe actual output accurately reflects the key figures and calculations from the graded 0.80 (needs 0.50)
Competitive positioningfailfailThe actual output does not provide the expected competitive positioning summary. graded 0.20 (needs 0.50)
Timeline extractpasspassThe actual output accurately presents the factual information and semantic meani graded 0.80 (needs 0.50)
Risk summarypassfailThe actual output correctly identifies the lack of provided sources, which is a graded 0.30 (needs 0.50)
Define groundedpasspassThe actual output matches the expected output exactly, with no discrepancies in graded 1.00 (needs 0.50)
Abstain missingpasspassThe actual output accurately identifies the lack of 2027 market share projection graded 0.90 (needs 0.50)
Two source agreepasspassThe actual output is factually accurate and semantically equivalent to the expec graded 0.90 (needs 0.50)
Trend reportfailfailThe actual output does not provide the factual information or semantic content o graded 0.20 (needs 0.50)
Quantify claimpassfailThe actual output does not provide any factual information or breakdown of cost graded 0.20 (needs 0.50)
RecommendationfailfailThe actual output does not provide the specific recommendation found in the expe graded 0.20 (needs 0.50)

The statistics behind it

Verdict: No significant regression  ·  McNemar p=0.500 (α=0.05)  ·  candidate chosen by explicit --candidate override

McNemar’s test asks whether the difference between the two models is bigger than chance: a p-value above 0.05 means the change is not statistically significant. The 95% CI on the pass-rate delta is the range the true difference is likely in — if it crosses zero, the direction isn’t certain.

-16.7pp
pass-rate delta [95% CI -37.8, +4.4]

Pass rate (95% CI whiskers)

Original9/12 (75%)
Candidate7/12 (58%)

Cost (measured)

Original$0.115765
Candidate$0.009568

How to read this

Generated 2026-06-26 13:41 · tokenjam-bench · proof report