Eval Report · T48 Matrix · Scan e5ce0769

Sonnet is the only model that flags Akamai correctly.

Seven models, four prompts, twenty-seven complete evals on a single Staples scan. The cheap tier shipped a confidently-wrong scraping strategy. Here is what to keep, what to swap, and what it costs.
Scan
e5ce0769
Domain
staples.com
Models
7
Prompts
4 evaluated
Date
2026-05-12

Keep Sonnet for scan_synthesis and intent_filter. Swap the rest to Grok 3 Mini.

Cheaper models on the two judgement-heavy prompts return overconfident, production-unsafe advice — httpx + no proxy at 0.93–0.97 confidence — on a site sitting behind Akamai Bot Manager. On the two extraction-style prompts (notes, difficulty_drivers) every model produces comparable output, so the cheapest competent option wins.

$0.343
Cost per scan, proposed config
$0.378
Cost per scan today, all‑Claude
−9.2%
Savings vs status quo
0 / 6
Cheap models that caught Akamai
01 / Methodology

One scan, replayed seven ways.

The T48 eval matrix replays a real production capture against alternate providers. Each cell is one prompt, one model, one full call against the same packed input. Production rows (Claude Sonnet for synthesis + intent filter, Haiku for everything else) are the baseline; OpenAI and xAI rows are the challengers.

Packed scan input CAPTURE · BUCKETS · DUMP Provider router PREFIX → SDK anthropic SDK openai SDK · openai.com openai SDK · api.x.ai claude-sonnet-4-6 claude-haiku-4-5 gpt-5.4 gpt-5.4-mini gpt-5.4-nano grok-4.3 grok-3-mini llm_evals (jsonb) APPEND-ONLY HISTORY staples.com scan comparison matrix
Run summary — all complete rows for scan e5ce0769
Prompt Production OpenAI runs xAI runs Total cells
scan_synthesisclaude-sonnet-4-6gpt-5.4, mini, nanogrok-4.3, grok-3-mini5
intent_filterclaude-sonnet-4-6gpt-5.4, mini, nanogrok-4.3, grok-3-mini5
notesclaude-haiku-4-5gpt-5.4, mini, nanogrok-4.3, grok-3-mini5
difficulty_driversclaude-haiku-4-5gpt-5.4, mini, nanogrok-4.3, grok-3-mini5
02 / Headline finding · scan_synthesis

The Akamai blind spot.

Sonnet correctly reads the inventory — _abck, bm_sz, ak_bmsc cookies, plus a getReviews endpoint that explicitly preferred curl_cffi over httpx in Bucket A validation. Every cheaper model anchors on the surface fact that Bucket A passed with httpx and stops there.

Model Library Impersonation Proxy Confidence Cost Verdict
claude-sonnet-4-6 curl_cffi chrome120 datacenter 0.82 $0.3092 Production-safe
gpt-5.4 httpx none 0.93 $0.1912 Overconfident
gpt-5.4-mini httpx none 0.97 $0.0576 Overconfident
gpt-5.4-nano httpx none 0.93 $0.0159 Overconfident
grok-4.3 httpx none 0.90 $0.0800 Overconfident
grok-3-mini httpx none 0.90 $0.0194 Overconfident

Sonnet caught the getReviews tie-break that every other model missed.

Sonnet · production

Validation flagged one Bucket A endpoint (getReviews) where curl_cffi beat httpx head-to-head. Sonnet promoted this into the global recommendation. The five cheaper models all noticed Akamai cookies in the inventory but reasoned away from them: "Bucket A passed with httpx, therefore the site is fine."

"Akamai Bot Manager is the primary protection on staples.com, with _abck and bm_sz cookies present across all endpoints. curl_cffi with Chrome impersonation is recommended because one key data endpoint (getReviews) preferred it, and Akamai's TLS fingerprinting can block plain httpx at scale; estimated cost is $0.05–$0.15 per 1,000 requests using datacenter proxies." — claude-sonnet-4-6, verdict.verdict

The cheap-tier failure mode is identical across providers.

5 / 5 models

Every non-Sonnet model converges on the same wrong answer at high confidence. Different prose, same recommendation.

gpt-5.4 (0.93): "Bucket B prerequisite probes also succeeded with httpx and no blocking." gpt-5.4-mini (0.97): "the validated data endpoints themselves did not require a proxy or browser-only tooling." gpt-5.4-nano (0.93): "While Akamai cookies are present in the environment, the tested data endpoints indicate cookies are required=false in validation, so proxies are not needed to start." grok-3-mini (0.90): "Httpx was the best-performing library across all validated Bucket A endpoints, successfully retrieving data without blocks." grok-4.3 (0.90): "Akamai signals are present but do not trigger blocking under plain httpx with the captured baseline headers."

This is the prompt where judgement under uncertainty matters. Saving $0.29/scan by swapping in gpt-5.4-mini here means shipping a scraper that breaks the first time Akamai tightens its challenge — and the model told you with 97% confidence it wouldn't.

03 / Bucket triage · intent_filter

Cheaper models over-prune Bucket B.

The intent filter sorts endpoints into A (scrape directly), B (prerequisite — token, config, session) and C (analytics noise). Wrongly dropping a B endpoint into C breaks the scrape silently downstream. Sonnet keeps 14 endpoints in B; cheaper models collapse it as low as 1.

claude-sonnet-4-6 production
14 endpoints — full prerequisite set retained
14
grok-3-mini
6
6
grok-4.3
3
3
gpt-5.4
2
2
gpt-5.4-mini
7
7
gpt-5.4-nano
1
1
Demoted out of Bucket B by 5/5 cheaper models: ep_021 / ep_022 (checkout / payment gateway session init) and ep_004 / ep_014 (storeLocator, needed for pricing & availability). Sonnet keeps both pairs.
Silent breakage risk
gpt-5.4-nano collapses Bucket B to one endpoint (ep_011 getMenuContent) and routes 29 of 38 to Bucket C — half of which are real first-party endpoints, not analytics noise.
Aggressive demotion
grok-3-mini misclassifies ep_001 / ep_015 (atRecommendation / Criteo retargeting pixels) into Bucket A — sending the scraper after ad-tech URLs.
A/C confusion
Raw bucket assignments — all 6 models (38 endpoints)
claude-sonnet-4-6 (production): A( 9): ep_012, ep_019, ep_020, ep_027, ep_028, ep_031, ep_032, ep_037, ep_038 B(14): ep_003, ep_004, ep_005, ep_011, ep_013, ep_014, ep_021, ep_022, ep_025, ep_026, ep_029, ep_033, ep_034, ep_035 C(15): ep_001, ep_002, ep_006-010, ep_015-018, ep_023, ep_024, ep_030, ep_036 gpt-5.4: A(11): ep_011, ep_012, ep_019, ep_020, ep_025, ep_027, ep_031, ep_032, ep_035, ep_037, ep_038 B( 2): ep_005, ep_013 C(25): ep_001-004, ep_006-010, ep_014-018, ep_021-024, ep_026, ep_028… gpt-5.4-mini: A(12): ep_012, ep_019, ep_020, ep_025, ep_026, ep_027, ep_028, ep_031, ep_032, ep_035, ep_037, ep_038 B( 7): ep_005, ep_013, ep_018, ep_021, ep_022, ep_029, ep_034 C(19) gpt-5.4-nano: A( 8): ep_012, ep_019, ep_020, ep_027, ep_037, ep_038, ep_031, ep_032 B( 1): ep_011 C(29) grok-3-mini: A( 8): ep_001, ep_012, ep_015, ep_019, ep_020, ep_027, ep_037, ep_038 B( 6): ep_004, ep_011, ep_014, ep_029, ep_030, ep_033 C(24) grok-4.3: A(13): ep_011, ep_012, ep_019, ep_020, ep_025, ep_026, ep_027, ep_028, ep_031, ep_032, ep_035, ep_037, ep_038 B( 3): ep_029, ep_033, ep_034 C(22)
04 / Extraction prompts · notes & difficulty_drivers

No meaningful spread.

Both prompts ask the model to summarize structured signals from the capture — what's already there, restated. Every model produces comparable output; quality variance is well inside what re-running Sonnet would produce. Pick the cheapest competent option.

Cost per call — both extraction prompts, all models, complete rows only
Model notes difficulty_drivers Combined Quality
claude-haiku-4-5 (prod) $0.0177 $0.0156 $0.0333 Baseline
gpt-5.4 $0.0403 $0.0396 $0.0799 Comparable
gpt-5.4-mini $0.0111 $0.0108 $0.0219 Comparable
gpt-5.4-nano $0.0030 $0.0036 $0.0067 Acceptable
grok-4.3 $0.0167 $0.0138 $0.0305 Comparable
grok-3-mini recommended $0.0039 $0.0039 $0.0078 Comparable

Why grok-3-mini wins on these two.

76% cheaper than Haiku

At $0.30 / $0.50 per 1M tokens Grok 3 Mini sits below Haiku's effective rate after caching, with a 131K context window and native structured-output support. Output quality on these two prompts is indistinguishable from Sonnet/Haiku — they restate captured signals rather than synthesize, so reasoning depth doesn't pay off.

gpt-5.4-nano is technically cheaper (~$0.007 combined vs. $0.008) but had a higher errored-rate during this run — one nano call errored on difficulty_drivers before completing on retry. Grok-3-mini completed both extraction prompts on the first attempt.

05 / Cost model · proposed config

$0.343 per scan, end-to-end.

Per-prompt model selection unchanged from T45 — set BROWSER_RECON_LLM_MODEL_NOTES=grok-3-mini and _DIFFICULTY_DRIVERS=grok-3-mini. Everything else stays.

Prompt Current model Current $ Proposed model Proposed $ Δ
scan_synthesisclaude-sonnet-4-6$0.3092claude-sonnet-4-6$0.3092
intent_filterclaude-sonnet-4-6$0.0294claude-sonnet-4-6$0.0294
flow_confirmclaude-haiku-4-5$0.0050claude-haiku-4-5$0.0050
intent_clarifierclaude-haiku-4-5$0.0014claude-haiku-4-5$0.0014
Total$0.3783$0.3428−$0.0255
Why not flip everything to Grok 3 Mini? Because the $0.29/scan you'd save on scan_synthesis buys a recommendation that's wrong on the only protection signal that matters here. Cost optimization stops at the judgement boundary.
Decision rule
Why not flip everything to GPT-5.4-nano? Nano is the cheapest non-Sonnet model overall, but it had the worst error rate in this run (3 errored cells before completing on retry) and the worst Bucket B triage (collapsed to 1 endpoint). Penny-wise, scrape-foolish.
Decision rule
Confidence band: One scan, one domain. Production-safe to roll out behind the existing per-prompt env-var override; revert is one redeploy. Worth A/B'ing on two more domains (one Cloudflare-protected, one unprotected) before locking in the default.
N = 1
06 / Latency footprint

Grok is slower; nobody cares.

Scans are an asynchronous background job. The 30–40s p50 on xAI is not user-visible. Documented here for completeness only.

Model scan_synthesis intent_filter notes difficulty_drivers
claude-sonnet-4-6 / haiku-4-5 43.3 s 17.0 s 6.4 s 9.5 s
gpt-5.426.9 s12.0 s10.1 s19.6 s
gpt-5.4-mini20.2 s7.7 s3.4 s5.2 s
gpt-5.4-nano21.7 s7.6 s4.2 s10.8 s
grok-4.324.7 s27.3 s12.2 s15.2 s
grok-3-mini37.4 s31.0 s10.4 s21.9 s
07 / What to do next

Three actions, in order.

1. Ship the env-var swap. Set BROWSER_RECON_LLM_MODEL_NOTES=grok-3-mini and BROWSER_RECON_LLM_MODEL_DIFFICULTY_DRIVERS=grok-3-mini on Render. No code change, no migration. Revert is one env-var edit.
15 min
2. Re-run evals on two more domains. One Cloudflare-protected (e.g. amazon.com, walmart.com), one unprotected (a SaaS API or news site). If Sonnet's edge holds across all three, lock in the config. If a cheap model gets scan_synthesis right on the unprotected site, consider a per-difficulty router — but only after seeing the data.
2 scans · ~$1
3. Add an alert on `recommendation.confidence > 0.90 AND protections detected`. The cheap-tier failure mode is mechanical: high confidence, ignored protection signals. A single SQL check on llm_calls rows would catch a regression next time we touch prompt wording or model selection.
Defensive