Stop guessing what works. Measure it through the right proxy.
~150s per Bucket A endpoint. The fix is a two-axis library × proxy cascade with parallelism, bandwidth tracking, and a persisted JSON shape rich enough to drive a real cost projection — not a guess.
Eight sub-tasks. One source of truth: the persisted validation_data blob.
Library and proxy decisions stop being inferred at synthesis time. Validation captures every library × proxy × header × scenario attempt, persists the full attempt log to a new append-only validation_runs table, and exposes two slim derived views — one for the LLM prompt, one for the report's new Validation section. The synthesis prompt's cost_band becomes a real number computed from observed bytes × the configured $/GB.
Three concrete failures in production.
The two Walmart scans on 2026-05-12 exposed every weakness in the current validation flow. The first scan (gpt-5.4 driving synthesis) shipped httpx + no proxy at 0.91 confidence against a site running PerimeterX, Akamai, Imperva and Cloudflare. The second (Sonnet driving synthesis) caught the protections — but its residential proxy recommendation was reasoned from priors, not measured. Validation should be doing this measurement.
| Claim in the report | What was actually measured | Gap |
|---|---|---|
"Use curl_cffi + chrome120" |
curl_cffi returned 200 from the developer's residential IP |
Datacenter never tested |
| "Recommend residential proxy" | Nothing — no proxy was used in validation | Pure inference |
| "Cost $0.40–$1.50 per 1k requests" | Hand-typed default in the synthesis system prompt | Not derived from data |
"min_required_headers = [User-Agent, Referer]" |
17 one-at-a-time removal requests over ~26s with curl_cffi only |
Library-specific; serial |
| "Rate-limit safe at 0.75s" | 75 sequential requests at 5 graduated delays, single library | No proxy context |
Two-axis cascade: library × proxy.
Every sub-test walks the same fallback ladder: cheapest library at its cheapest working proxy first; escalate the proxy before escalating the library; promote to cloudscraper only if the first three libraries all fail with both proxies. Each sub-test caps at 6 attempts.
Wall-clock budget per Bucket A endpoint
| Step | Today | After T51 | How |
|---|---|---|---|
| 1 · Library compare | ~6 s | ~3 s | 3 libs × 2 proxies all in one ThreadPoolExecutor pass |
| 2 · Header reduction | ~26 s | ~4 s | Tier-list skips 9 headers + parallel tests cap=5 |
| 3 · Cookie dependency | ~5 s | ~3 s | Parallel across 3 scenarios; cascade on failures only |
| 4 · Rate-limit probe | ~120 s worst | ~15 s worst | 3 rounds × 5 reqs (was 5 × 15) |
| Total | ~150 s | ~25 s | −83% |
Eight changes, sequenced.
Each sub-task is independently shippable behind the flag BROWSER_RECON_USE_T51_VALIDATION=1. T51.8 (compaction + report) is the final flip — until it lands the existing payload shape remains.
Proxy ladder + 3-sample residential
What changes
Read DATACENTER_PROXY and RESIDENTIAL_PROXY from env. Every HTTP call inside library_compare.py threads a proxy_url argument through. Per-library ladder: try datacenter once; on block/fail, fire 3 sequential requests through residential. Residential outcome:
- 3 / 3 pass → record
working_tier = "residential", spot-check passed - 2 / 3 pass → record
working_tier = "residential", label as flaky - 0–1 / 3 pass → library marked blocked under residential too
When env vars are unset
Validation runs on the developer's direct IP — it's the user's CLI machine, not a server. The report's Validation section surfaces a No proxies configured banner and the synthesis system prompt is told proxy_status="direct_ip" so its recommendation includes a caveat.
Env vars to wire
DATACENTER_PROXY=http://user:pass@dc.example.com:8080 RESIDENTIAL_PROXY=http://user:pass@residential.example.com:8080 PROXY_DATACENTER_PRICE_USD_PER_GB=2.00 PROXY_RESIDENTIAL_PRICE_USD_PER_GB=3.00
Acceptance
- Walmart re-validation with both env vars set produces
working_tieron every passing library - Same scan without env vars set produces direct-IP results + "No proxies configured" banner in the report
Parallel library compare + cloudscraper fallback
What changes
Replace the serial for lib_name, func in _LIBRARY_FUNCS.items() loop in library_compare.py:351 with a ThreadPoolExecutor(max_workers=6) that fans out 3 libraries × 2 proxy attempts in one wave. Once Phase A returns, pick the winner by preference order, not lowest latency:
# Preference cascade — first passing library wins. LIBRARY_PREFERENCE = ["requests", "httpx", "curl_cffi"] LIBRARY_FALLBACK = ["cloudscraper"] # Phase B only passing = [ lib for lib in LIBRARY_PREFERENCE if phase_a[lib].working_tier is not None ] if passing: best = passing[0] # requests > httpx > curl_cffi else: best = run_phase_b(LIBRARY_FALLBACK) # cloudscraper, both proxies
Why preference order, not latency
A scraper that works under requests is the simplest possible production artifact — fewer deps, more readable starter code. Latency differences inside validation are noise (single sample, 200–400ms variance is normal). Preference order is a deliberate simplicity bias.
Cloudscraper as Phase B only
cloudscraper pulls in js2py, can take 4–15s when it has to solve a challenge, and fails noisily. It fires only when all 6 Phase A attempts fail — meaning the site is genuinely hard and we need a JS-aware tool. On easy sites it never runs.
Bucket B endpoints — inherit scan-level winner, verify cheaply
Bucket A endpoints run the full library × proxy fan-out. Bucket B endpoints do not — they're prerequisites (bootstrap calls, session config), and production code will hit them with the same (library, proxy_tier) the scraper already uses for Bucket A. Running an independent ladder for each B endpoint would waste requests on data we know we won't act on differently.
scan_winner = (best_library, best_proxy_tier) emerge from the union of A endpoints.0 reqscan_winner. Record status + bytes.1 reqExpected cost: 1 request per B endpoint on most scans (winner inherits cleanly). The mini-fan-out only fires when a B endpoint has stricter checks than the A endpoints — rare, but the cascade catches it.
Dependency
rye add cloudscraper
Acceptance
- 3 Phase A libraries × 2 proxies fire in < 2s wall-clock (was ~6s serial)
- Cloudscraper never appears in
library_compare.attemptswhen any Phase A library passed - Cloudscraper appears in
library_compare.attemptswhen Phase A is all-blocked - Bucket B reachability completes in < (1.5s × N_b_endpoints) when scan_winner inherits cleanly
Header reduction — tier list + parallel + 2-axis cascade
Tier list (skip-without-testing)
| Tier | Behavior | Members |
|---|---|---|
| always_required | kept by default; never tested | User-Agent |
| always_optional | dropped without testing | Accept-Encoding, Cache-Control, Pragma, DNT, Upgrade-Insecure-Requests, sec-fetch-*, sec-ch-ua-* |
| test_individually | actual removal probe | Everything else: Referer, Origin, Content-Type, Authorization, X-*, Accept, Accept-Language |
Parallel + cascade algorithm
test_individually list (typically 4–8 per endpoint).0 reqmax_in_flight = 5): fire each removal test with (passing[0], working_tier[0]).~1 wave(passing[0], residential).+1 wave(passing[1], working_tier[1]). And so on, hard cap = 6 attempts per header.+up to 4 wavesResult shape (one header)
{
"Referer": {
"required_under": [
{"library": "requests", "proxy": "datacenter"},
{"library": "requests", "proxy": "residential"}
],
"optional_under": [
{"library": "curl_cffi", "proxy": "datacenter"}
],
"attempts": [ /* full attempt log, persisted */ ]
}
}
What the report shows
A two-row strip: REQUIRED (collapsed to a single combo recommendation) and DROPPED. The richer per-library detail lives in the evidence accordion. Rationale: production scrapers want one config, not a decision tree.
Acceptance
- 15-header endpoint completes header reduction in < 6s wall-clock (was ~26s)
- Tier-skipped headers never appear as attempts in the persisted blob
Cookie dependency — 2-axis cascade
Algorithm
Three scenarios (cold, warmup, full) fired in parallel with (passing[0], working_tier[0]). Any scenario that fails cascades through the same 6-attempt ladder as T51.3.
Result shape
{
"per_scenario": {
"cold": { "passes_under": [{"lib":"curl_cffi","proxy":"datacenter"}], ... },
"warmup": { "passes_under": [{"lib":"requests","proxy":"residential"}], ... },
"full": { "passes_under": [...] }
},
"minimum_scenario": {
"requests": "warmup_required",
"curl_cffi": "cold",
"cloudscraper": "warmup_required"
},
"caveat": "point-in-time; anti-bot cookies have TTL"
}
Acceptance
- 3 scenarios × winning combo fire in < 3s parallel
- Cascade only fires on failed scenarios
Rate-limit probe — shrink + rotation flag
Numerics
| Today | After T51.5 | |
|---|---|---|
| Rounds | [3.0, 2.0, 1.0, 0.5, 0.2] | [2.0, 0.5, 0.2] |
| Requests per round | 15 | 5 |
| Max total | 75 | 15 |
| Run policy | always | always (never skipped) |
Rotation labelling — the critical addition
If the winning proxy is rotating-residential, the probe fires 5 requests from 5 different IPs and the target physically cannot rate-limit. We still run the probe (per your call), but the result carries an explicit warning so the synthesis prompt does not take the measured delay at face value:
{
"library_used": "requests",
"proxy_used": "residential",
"proxy_rotation_mode": "rotating", // NEW
"measurement_caveat": "Cadence measured across rotating IPs.
Production single-IP behavior may trigger
limits sooner. Treat estimated_safe_delay_s
as a lower bound; run extensive tests.",
"estimated_safe_delay_s": 0.75,
"rounds": [ ... ]
}
Synthesis prompt addendum
If `proxy_rotation_mode == "rotating"` on a rate_limit result, prefer a conservative production delay (≥ 1.5 s) in your recommendation regardless of the measured `estimated_safe_delay_s`. Surface the caveat in the verdict.
Acceptance
- Probe completes in < 15s worst-case
proxy_rotation_modeset correctly based on a per-proxy config flag
Bandwidth tracking + cost projection
Per-request additions
Every library function in library_compare.py already returns status, elapsed_ms, body_full, headers. Add two fields, computed before the verbose ones are stripped:
"request_bytes" : 1840, # sum(name+value+4) for headers + len(body) "response_bytes" : 54120, # sum of response headers + body
Per-endpoint rollup
{
"bandwidth_summary": {
"request_count": 38,
"bytes_sent_total": 70450,
"bytes_received_total": 1872400,
"avg_request_bytes": 1854,
"avg_response_bytes": 49273,
"per_1k_requests_mb": 49.3
}
}
Cost formula (per-tier rate, current vendor)
rate_usd_per_gb = {
"datacenter": PROXY_DATACENTER_PRICE_USD_PER_GB, # 2.00
"residential": PROXY_RESIDENTIAL_PRICE_USD_PER_GB, # 3.00
}[best_proxy_tier]
cost_per_1k_requests_usd = (
avg_request_bytes + avg_response_bytes
) * 1000 / 1e9 * rate_usd_per_gb
# Walmart product page on datacenter:
# (1854 + 49273) * 1000 / 1e9 * 2.00 = $0.102 per 1k requests
# Same endpoint forced to residential:
# (1854 + 49273) * 1000 / 1e9 * 3.00 = $0.153 per 1k requests
Synthesis prompt impact
The cost_band_low_usd and cost_band_high_usd fields stop being hand-picked. The bandwidth_summary is passed in; the prompt instructs the model to compute the band from observed bytes × the vendor rate, plus a 20% retry headroom.
Acceptance
- Every persisted attempt carries
request_bytesandresponse_bytes - Walmart re-validation report shows a computed cost per 1k requests, not a guessed range
DB schema — append-only validation_runs
Why append-only
A site's defenses change. If today's validation says datacenter works and a re-run tomorrow says datacenter blocked, both outcomes are evidence. Same pattern as llm_evals (T48): every run inserts a row, the latest non-errored row per (scan_id, endpoint_id) is "current."
Schema
CREATE TABLE validation_runs ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), scan_id uuid NOT NULL REFERENCES scans(id), endpoint_id text NOT NULL, status text NOT NULL, -- 'complete' | 'errored' | 'partial' created_at timestamptz NOT NULL DEFAULT now(), duration_ms int NOT NULL, validation_data jsonb NOT NULL, -- the rich blob from T51.8 error_message text ); CREATE INDEX idx_validation_runs_scan_endpoint ON validation_runs (scan_id, endpoint_id, created_at DESC);
Read path
SELECT DISTINCT ON (scan_id, endpoint_id) * FROM validation_runs WHERE scan_id = $1 AND status = 'complete' ORDER BY scan_id, endpoint_id, created_at DESC;
Acceptance
- Alembic migration creates the table + index
- Re-running validation on a scan inserts a new row, doesn't overwrite
LLM payload compaction + Validation section in report
Two derived projections over validation_data
The persisted blob is the source of truth; the LLM and the report each read a slim derived view. Nothing recomputes from raw attempts at read time — the projections are materialized during validation and stored as separate JSONB keys on the same row (llm_view, report_view).
LLM payload (~250 tok / endpoint, was ~4,600)
{
"endpoint_id": "ep_017",
"method": "GET",
"best": {
"library": "requests",
"proxy_tier": "datacenter",
"elapsed_ms": 320
},
"library_matrix": {
"requests": {"datacenter": "ok:200",
"residential": "skipped"},
"httpx": {"datacenter": "ok:200",
"residential": "skipped"},
"curl_cffi": {"datacenter": "ok:200",
"residential": "skipped"},
"cloudscraper": {"phase": "B_skipped"}
},
"min_required_headers": ["User-Agent", "Referer"],
"headers_tested": 5, "headers_skipped": 12,
"cookies": "warmup_required",
"rate_limit": {
"safe_delay_s": 0.75,
"proxy_rotation_mode": "static",
"trigger": "429@0.2s/round3"
},
"bandwidth": {"per_1k_requests_usd": 0.128}
}
Report section — Validation
Mock below. Same identity as the rest of the report; rendered server-side from report_view.
What the user actually sees.
A four-stat summary panel at the top; one validation card per Bucket A endpoint; raw attempts behind a per-endpoint evidence accordion. Each card has four blocks: library × proxy matrix, header strip, cookie scenarios, rate-limit chart. All rendered live from the new report_view JSON.
Endpoint · GET /orchestra/pdp/graphql/ItemById/<id>
Raw attempts · 28 entries
[
{"phase": "A", "lib": "requests", "proxy": "datacenter",
"status": 200, "elapsed_ms": 320, "req_bytes": 1840, "resp_bytes": 54120},
{"phase": "A", "lib": "httpx", "proxy": "datacenter",
"status": 200, "elapsed_ms": 290, "req_bytes": 1840, "resp_bytes": 54120},
// ... 26 more
]
One blob, two projections.
Every validation run inserts one row into validation_runs with the full attempt log in validation_data. Two derived JSONB keys — llm_view and report_view — are computed during validation and stored alongside. Reads never recompute.
Full validation_data shape (one endpoint, expanded)
{
"endpoint_id": "ep_017",
"url_template": "/orchestra/pdp/graphql/ItemById/<id>",
"method": "GET",
"validation_started_at": "2026-05-13T22:14:08Z",
"validation_duration_ms": 24310,
"proxy_status": "both_tiers_configured", // "direct_ip" | "datacenter_only" | "residential_only" | "both_tiers_configured"
"library_compare": {
"attempts": [
{"phase": "A", "library": "requests", "proxy": "datacenter",
"status": 200, "elapsed_ms": 320, "is_block": false,
"request_bytes": 1840, "response_bytes": 54120},
{"phase": "A", "library": "httpx", "proxy": "datacenter",
"status": 200, "elapsed_ms": 290, "is_block": false,
"request_bytes": 1840, "response_bytes": 54120}
/* ... all attempts, including residential samples + cloudscraper if Phase B fired */
],
"passing_libraries": [
{"library": "requests", "working_tier": "datacenter"},
{"library": "httpx", "working_tier": "datacenter"},
{"library": "curl_cffi", "working_tier": "datacenter"}
],
"best_library": "requests",
"best_proxy_tier": "datacenter"
},
"header_reduction": {
"tested_headers": ["Referer", "Origin", "X-Trace-Id", "Content-Type", "Accept"],
"skipped_by_tier": ["Accept-Encoding", "Cache-Control", "sec-ch-ua", /* 9 total */],
"per_header": {
"Referer": {
"required_under": [{"lib":"requests","proxy":"datacenter"}],
"optional_under": [{"lib":"curl_cffi","proxy":"datacenter"}],
"attempts": [ /* per-attempt log */ ]
}
/* ... one entry per tested header */
}
},
"cookie_dependency": {
"per_scenario": { /* cold | warmup | full */ },
"minimum_scenario": { "requests": "warmup_required", /* per library */ },
"caveat": "point-in-time; anti-bot cookies have TTL"
},
"rate_limit": {
"library_used": "requests",
"proxy_used": "datacenter",
"proxy_rotation_mode": "static",
"rounds": [
{"delay_s": 2.0, "requests_fired": 5, "passes": 5, "fails": 0,
"avg_latency_ms": 412, "trigger": null},
{"delay_s": 0.5, "requests_fired": 5, "passes": 5, "fails": 0},
{"delay_s": 0.2, "requests_fired": 5, "passes": 3, "fails": 2,
"trigger": "http_429_at_req_3"}
],
"estimated_safe_delay_s": 0.75,
"last_safe_round_delay_s": 0.5
},
"bandwidth_summary": {
"request_count": 28,
"bytes_sent_total": 51520,
"bytes_received_total": 1379240,
"avg_request_bytes": 1840,
"avg_response_bytes": 49258,
"per_1k_requests_mb": 51.1,
"per_1k_requests_usd": 0.128 // 51.1 MB × $2.50/GB
},
"warnings": []
}
What this still doesn't prove.
Five honest limits baked into the design. Each is surfaced in the report copy so users know what they're looking at.
| Limit | Why it matters | How we surface it |
|---|---|---|
| Spot-check, not validation-at-scale | ~100 requests over 30s ≠ 50,000 requests / hour from a sticky IP | Top-of-section banner with the request count + timestamp; synthesis prompt gets validation_scope: "spot_check" |
| Rotating-residential rate-limit is structurally noisy | Each request is a different IP; target can't rate-limit a single source it never sees again | proxy_rotation_mode = "rotating" flag + measurement caveat in the JSON; synthesis prompt told to recommend ≥ 1.5s regardless |
| Cookie state is point-in-time | Anti-bot cookies have TTLs; "cold works" today may need warmup in 5 minutes | caveat field on the cookie result; pill in the report's header strip |
| Direct-IP fallback when proxies unset | Local CLI runs validate from the developer's home IP; production scrape will differ | Yellow banner at the top of the Validation section; proxy_status = "direct_ip" in JSON |
| Single-attempt header reduction is noisy | One failed removal request promoted to "required" could be flaky, not a true requirement | Cascade naturally re-tests with a different library; cross-library agreement on "required" raises confidence |
Flag, ship, verify, flip.
BROWSER_RECON_USE_T51_VALIDATION=1. Default off. Legacy code path stays.~5 daysvalidation_runs rows alongside the legacy results.A/Blibrary_matrix + bandwidth.per_1k_requests_usd against today's synthesis output. Sonnet's recommendation should change from "residential (inferred)" to whichever tier actually passed.verifyBROWSER_RECON_USE_LEGACY_VALIDATION path after 1 production cycle without rollback.cleanupAll decisions closed.
No remaining open questions. Recording the closed decisions so engineering doesn't relitigate them at ticket time.
| Question | Resolution | Applied at | |
|---|---|---|---|
| A | Per-tier proxy pricing? | $2.00/GB datacenter, $3.00/GB residential. Split into two env vars. |
T51.1 env list · T51.6 cost formula |
| B | Bucket B endpoints — proxy ladder, and inherit scan winner? | Yes proxy. Inherit scan-level (library, proxy_tier) winner from Bucket A; verify with one request per B endpoint; mini-fan-out only on verification failure. |
T51.2 "Bucket B endpoints" sub-section |
| C | Partial-completion semantics? | Keep partial. Validation runs ship with explicit per-step status (complete | partial | errored) so downstream consumers know which sub-results to trust. |
T51.7 schema (status column already supports it) |
| D | Cloudscraper Phase B — fall through to direct IP if both proxies block? | No. Result isn't actionable for production scraping; endpoint is marked needs_better_proxy_or_browser. |
T51.2 Phase B definition |