A guided walkthrough of one real upgrade. Live in your terminal.
Paired comparison of opus-4-6 and
opus-4-7 on the
context_rot_reasoning__context_rot suite (32
cases, McNemar's exact). Data replayed from committed outcomes for
reproducibility.
Accuracy is the number every benchmark leaderboard reports. By this measure, the upgrade looks like a win.
Same prompts. Same correctness. But the bill tells a different story.
By context length — does the inflation hit you everywhere, or only on long prompts?
| Subgroup | n | Baseline acc | Challenger acc | Δ acc | Baseline $/correct | Challenger $/correct | Δ % |
|---|---|---|---|---|---|---|---|
| distractor:0k | 8 | 0.875 | 0.875 | +0.000 | $0.0015 | $0.0020 | +30.6% |
| distractor:2k | 8 | 0.875 | 0.875 | +0.000 | $0.0358 | $0.0517 | +44.4% |
| distractor:32k | 8 | 0.875 | 0.875 | +0.000 | $0.5501 | $0.7974 | +45.0% |
| distractor:8k | 8 | 0.750 | 0.875 | +0.125 | $0.1618 | $0.2008 | +24.2% |
Recommendation. Do NOT migrate short-prompt workloads to opus-4-7 on price-parity assumptions. The quality lift on this suite does not pay for tokenizer inflation.
Action items.
Findings replicate on the committed outcomes file (n=32, paired, McNemar's exact). Significance threshold α=0.05; cost figures use list pricing — apply your enterprise multiplier for contracted rates.
Runs offline, no API key required. The recorded outcomes are committed to the repo.
rift demo # or: python benchmarks/run_context_rot.py --mode record
Sources:
benchmarks/context_rot_outcomes.yaml, benchmarks/context_rot_opus47_analysis.md