Sonnet is the only model that flags Akamai correctly.
Keep Sonnet for scan_synthesis and intent_filter. Swap the rest to Grok 3 Mini.
Cheaper models on the two judgement-heavy prompts return overconfident, production-unsafe advice — httpx + no proxy at 0.93–0.97 confidence — on a site sitting behind Akamai Bot Manager. On the two extraction-style prompts (notes, difficulty_drivers) every model produces comparable output, so the cheapest competent option wins.
One scan, replayed seven ways.
The T48 eval matrix replays a real production capture against alternate providers. Each cell is one prompt, one model, one full call against the same packed input. Production rows (Claude Sonnet for synthesis + intent filter, Haiku for everything else) are the baseline; OpenAI and xAI rows are the challengers.
| Prompt | Production | OpenAI runs | xAI runs | Total cells |
|---|---|---|---|---|
| scan_synthesis | claude-sonnet-4-6 | gpt-5.4, mini, nano | grok-4.3, grok-3-mini | 5 |
| intent_filter | claude-sonnet-4-6 | gpt-5.4, mini, nano | grok-4.3, grok-3-mini | 5 |
| notes | claude-haiku-4-5 | gpt-5.4, mini, nano | grok-4.3, grok-3-mini | 5 |
| difficulty_drivers | claude-haiku-4-5 | gpt-5.4, mini, nano | grok-4.3, grok-3-mini | 5 |
The Akamai blind spot.
Sonnet correctly reads the inventory — _abck, bm_sz, ak_bmsc cookies, plus a getReviews endpoint that explicitly preferred curl_cffi over httpx in Bucket A validation. Every cheaper model anchors on the surface fact that Bucket A passed with httpx and stops there.
| Model | Library | Impersonation | Proxy | Confidence | Cost | Verdict |
|---|---|---|---|---|---|---|
| claude-sonnet-4-6 | curl_cffi | chrome120 | datacenter | 0.82 | $0.3092 | Production-safe |
| gpt-5.4 | httpx | — | none | 0.93 | $0.1912 | Overconfident |
| gpt-5.4-mini | httpx | — | none | 0.97 | $0.0576 | Overconfident |
| gpt-5.4-nano | httpx | — | none | 0.93 | $0.0159 | Overconfident |
| grok-4.3 | httpx | — | none | 0.90 | $0.0800 | Overconfident |
| grok-3-mini | httpx | — | none | 0.90 | $0.0194 | Overconfident |
Sonnet caught the getReviews tie-break that every other model missed.
Sonnet · production
Validation flagged one Bucket A endpoint (getReviews) where curl_cffi beat httpx head-to-head. Sonnet promoted this into the global recommendation. The five cheaper models all noticed Akamai cookies in the inventory but reasoned away from them: "Bucket A passed with httpx, therefore the site is fine."
"Akamai Bot Manager is the primary protection on staples.com, with _abck and bm_sz cookies present across all endpoints. curl_cffi with Chrome impersonation is recommended because one key data endpoint (getReviews) preferred it, and Akamai's TLS fingerprinting can block plain httpx at scale; estimated cost is $0.05–$0.15 per 1,000 requests using datacenter proxies." — claude-sonnet-4-6, verdict.verdict
The cheap-tier failure mode is identical across providers.
5 / 5 modelsEvery non-Sonnet model converges on the same wrong answer at high confidence. Different prose, same recommendation.
gpt-5.4 (0.93): "Bucket B prerequisite probes also succeeded with httpx and no blocking." gpt-5.4-mini (0.97): "the validated data endpoints themselves did not require a proxy or browser-only tooling." gpt-5.4-nano (0.93): "While Akamai cookies are present in the environment, the tested data endpoints indicate cookies are required=false in validation, so proxies are not needed to start." grok-3-mini (0.90): "Httpx was the best-performing library across all validated Bucket A endpoints, successfully retrieving data without blocks." grok-4.3 (0.90): "Akamai signals are present but do not trigger blocking under plain httpx with the captured baseline headers."
This is the prompt where judgement under uncertainty matters. Saving $0.29/scan by swapping in gpt-5.4-mini here means shipping a scraper that breaks the first time Akamai tightens its challenge — and the model told you with 97% confidence it wouldn't.
Cheaper models over-prune Bucket B.
The intent filter sorts endpoints into A (scrape directly), B (prerequisite — token, config, session) and C (analytics noise). Wrongly dropping a B endpoint into C breaks the scrape silently downstream. Sonnet keeps 14 endpoints in B; cheaper models collapse it as low as 1.
Raw bucket assignments — all 6 models (38 endpoints)
No meaningful spread.
Both prompts ask the model to summarize structured signals from the capture — what's already there, restated. Every model produces comparable output; quality variance is well inside what re-running Sonnet would produce. Pick the cheapest competent option.
| Model | notes | difficulty_drivers | Combined | Quality |
|---|---|---|---|---|
| claude-haiku-4-5 (prod) | $0.0177 | $0.0156 | $0.0333 | Baseline |
| gpt-5.4 | $0.0403 | $0.0396 | $0.0799 | Comparable |
| gpt-5.4-mini | $0.0111 | $0.0108 | $0.0219 | Comparable |
| gpt-5.4-nano | $0.0030 | $0.0036 | $0.0067 | Acceptable |
| grok-4.3 | $0.0167 | $0.0138 | $0.0305 | Comparable |
| grok-3-mini recommended | $0.0039 | $0.0039 | $0.0078 | Comparable |
Why grok-3-mini wins on these two.
76% cheaper than Haiku
At $0.30 / $0.50 per 1M tokens Grok 3 Mini sits below Haiku's effective rate after caching, with a 131K context window and native structured-output support. Output quality on these two prompts is indistinguishable from Sonnet/Haiku — they restate captured signals rather than synthesize, so reasoning depth doesn't pay off.
gpt-5.4-nano is technically cheaper (~$0.007 combined vs. $0.008) but had a higher errored-rate during this run — one nano call errored on difficulty_drivers before completing on retry. Grok-3-mini completed both extraction prompts on the first attempt.
$0.343 per scan, end-to-end.
Per-prompt model selection unchanged from T45 — set BROWSER_RECON_LLM_MODEL_NOTES=grok-3-mini and _DIFFICULTY_DRIVERS=grok-3-mini. Everything else stays.
| Prompt | Current model | Current $ | Proposed model | Proposed $ | Δ |
|---|---|---|---|---|---|
| scan_synthesis | claude-sonnet-4-6 | $0.3092 | claude-sonnet-4-6 | $0.3092 | — |
| intent_filter | claude-sonnet-4-6 | $0.0294 | claude-sonnet-4-6 | $0.0294 | — |
| notes | claude-haiku-4-5 | $0.0177 | grok-3-mini | $0.0039 | −$0.0138 |
| difficulty_drivers | claude-haiku-4-5 | $0.0156 | grok-3-mini | $0.0039 | −$0.0117 |
| flow_confirm | claude-haiku-4-5 | $0.0050 | claude-haiku-4-5 | $0.0050 | — |
| intent_clarifier | claude-haiku-4-5 | $0.0014 | claude-haiku-4-5 | $0.0014 | — |
| Total | $0.3783 | $0.3428 | −$0.0255 |
scan_synthesis buys a recommendation that's wrong on the only protection signal that matters here. Cost optimization stops at the judgement boundary.
Grok is slower; nobody cares.
Scans are an asynchronous background job. The 30–40s p50 on xAI is not user-visible. Documented here for completeness only.
| Model | scan_synthesis | intent_filter | notes | difficulty_drivers |
|---|---|---|---|---|
| claude-sonnet-4-6 / haiku-4-5 | 43.3 s | 17.0 s | 6.4 s | 9.5 s |
| gpt-5.4 | 26.9 s | 12.0 s | 10.1 s | 19.6 s |
| gpt-5.4-mini | 20.2 s | 7.7 s | 3.4 s | 5.2 s |
| gpt-5.4-nano | 21.7 s | 7.6 s | 4.2 s | 10.8 s |
| grok-4.3 | 24.7 s | 27.3 s | 12.2 s | 15.2 s |
| grok-3-mini | 37.4 s | 31.0 s | 10.4 s | 21.9 s |
Three actions, in order.
BROWSER_RECON_LLM_MODEL_NOTES=grok-3-mini and BROWSER_RECON_LLM_MODEL_DIFFICULTY_DRIVERS=grok-3-mini on Render. No code change, no migration. Revert is one env-var edit.
scan_synthesis right on the unprotected site, consider a per-difficulty router — but only after seeing the data.
llm_calls rows would catch a regression next time we touch prompt wording or model selection.