Rift demo cost-per-correct +39.7% replay · n=32

Rift demo — Opus 4.6 → 4.7

A guided walkthrough of one real upgrade. Live in your terminal.

What we tested

Paired comparison of opus-4-6 and opus-4-7 on the context_rot_reasoning__context_rot suite (32 cases, McNemar's exact). Data replayed from committed outcomes for reproducibility.

+3.12pp
accuracy delta
+39.7%
$/correct delta
1.45×
input tokens

Act 1 — Quality (what a casual eval sees)

Accuracy is the number every benchmark leaderboard reports. By this measure, the upgrade looks like a win.

opus-4-60.844opus-4-70.875

Headline reading

Accuracy +3.12pp (p = 1.000, not significant at α=0.05). 95% CI: [-0.156, +0.219].

Act 2 — Cost (what Rift sees)

Same prompts. Same correctness. But the bill tells a different story.

opus-4-6$0.1882opus-4-7$0.2630

The twist

$/correct rose +39.7% ($0.1882 → $0.2630). For byte-identical prompts, the challenger emits 1.45× more input tokens (337,920 → 489,984). At list-price parity, this is a silent per-prompt cost increase on migration.
95% CI on Δ $/correct: [+0.0225, +0.1437] (paired bootstrap, n=32)

Act 3 — Where the cost concentrates

By context length — does the inflation hit you everywhere, or only on long prompts?

opus-4-6opus-4-7$0.0015$0.00200k$0.0358$0.05172k$0.5501$0.797432k$0.1618$0.20088k
SubgroupnBaseline accChallenger accΔ accBaseline $/correctChallenger $/correctΔ %
distractor:0k80.8750.875+0.000$0.0015$0.0020+30.6%
distractor:2k80.8750.875+0.000$0.0358$0.0517+44.4%
distractor:32k80.8750.875+0.000$0.5501$0.7974+45.0%
distractor:8k80.7500.875+0.125$0.1618$0.2008+24.2%

Verdict — what to do Monday

Headline

Accuracy ticked up (+3.12pp, not significant at α=0.05), but $/correct rose +39.7%.

Recommendation. Do NOT migrate short-prompt workloads to opus-4-7 on price-parity assumptions. The quality lift on this suite does not pay for tokenizer inflation.

Action items.

Findings replicate on the committed outcomes file (n=32, paired, McNemar's exact). Significance threshold α=0.05; cost figures use list pricing — apply your enterprise multiplier for contracted rates.

Reproduce this

Runs offline, no API key required. The recorded outcomes are committed to the repo.

rift demo  # or: python benchmarks/run_context_rot.py --mode record

Sources: benchmarks/context_rot_outcomes.yaml, benchmarks/context_rot_opus47_analysis.md