TASKT51SPEC
Validation Redesign · 2026-05-13 · v1 · draft for review

Stop guessing what works. Measure it through the right proxy.

The current validation layer fires from the user's home IP, picks the lowest-latency library among all that pass, runs every captured header through a one-at-a-time reduction, and stops at ~150s per Bucket A endpoint. The fix is a two-axis library × proxy cascade with parallelism, bandwidth tracking, and a persisted JSON shape rich enough to drive a real cost projection — not a guess.
Author
Claude (drafting) · Lazy (review)
Scope
browser_recon/validation/
Predecessor
T43 (header hoist)
Cost target
−83% wall-clock
Status
Spec · ready for build

Eight sub-tasks. One source of truth: the persisted validation_data blob.

Library and proxy decisions stop being inferred at synthesis time. Validation captures every library × proxy × header × scenario attempt, persists the full attempt log to a new append-only validation_runs table, and exposes two slim derived views — one for the LLM prompt, one for the report's new Validation section. The synthesis prompt's cost_band becomes a real number computed from observed bytes × the configured $/GB.

~150s → ~25s
Wall-clock per Bucket A endpoint
14k → 1k
LLM input tokens from validation
$2 / $3 per GB
Datacenter / residential rates
8
Sub-tasks, shippable in order
01 / Rationale

Three concrete failures in production.

The two Walmart scans on 2026-05-12 exposed every weakness in the current validation flow. The first scan (gpt-5.4 driving synthesis) shipped httpx + no proxy at 0.91 confidence against a site running PerimeterX, Akamai, Imperva and Cloudflare. The second (Sonnet driving synthesis) caught the protections — but its residential proxy recommendation was reasoned from priors, not measured. Validation should be doing this measurement.

What today's validation actually proves, on the Walmart scan
Claim in the report What was actually measured Gap
"Use curl_cffi + chrome120" curl_cffi returned 200 from the developer's residential IP Datacenter never tested
"Recommend residential proxy" Nothing — no proxy was used in validation Pure inference
"Cost $0.40–$1.50 per 1k requests" Hand-typed default in the synthesis system prompt Not derived from data
"min_required_headers = [User-Agent, Referer]" 17 one-at-a-time removal requests over ~26s with curl_cffi only Library-specific; serial
"Rate-limit safe at 0.75s" 75 sequential requests at 5 graduated delays, single library No proxy context
02 / Algorithm overview

Two-axis cascade: library × proxy.

Every sub-test walks the same fallback ladder: cheapest library at its cheapest working proxy first; escalate the proxy before escalating the library; promote to cloudscraper only if the first three libraries all fail with both proxies. Each sub-test caps at 6 attempts.

PHASE A · 3 LIBRARIES · 2 PROXIES · PARALLEL FAN-OUT requests httpx curl_cffi DATACENTER RESIDENTIAL → 200 / 320 ms skipped → 200 / 290 ms skipped → 200 / 410 ms skipped PREFERENCE: requests > httpx > curl_cffi WINNER: requests + datacenter ≥1 pass ? ALL 6 FAILED PHASE B · CLOUDSCRAPER LAST-RESORT cloudscraper + datacenter cloudscraper + residential winning (library, proxy_tier) 2. header reduction cascade on failures only 3. cookie dependency cascade on failures only 4. rate-limit probe single (lib, proxy), no cascade CASCADE WITHIN A SINGLE SUB-TEST (steps 2-3 only): passing[0] + working_tier passing[0] + residential passing[1] + working_tier ... up to 6 attempts EARLY-EXIT ON FIRST PASS. CASCADE ONLY FIRES ON FAILED SUB-TEST.
FIG. 1 — Validation cascade. Phase A is one parallel fan-out. Phase B fires only when Phase A returns no passing library.

Wall-clock budget per Bucket A endpoint

StepTodayAfter T51How
1 · Library compare~6 s~3 s3 libs × 2 proxies all in one ThreadPoolExecutor pass
2 · Header reduction~26 s~4 sTier-list skips 9 headers + parallel tests cap=5
3 · Cookie dependency~5 s~3 sParallel across 3 scenarios; cascade on failures only
4 · Rate-limit probe~120 s worst~15 s worst3 rounds × 5 reqs (was 5 × 15)
Total~150 s~25 s−83%
03 / Sub-tasks

Eight changes, sequenced.

Each sub-task is independently shippable behind the flag BROWSER_RECON_USE_T51_VALIDATION=1. T51.8 (compaction + report) is the final flip — until it lands the existing payload shape remains.

T51.1

Proxy ladder + 3-sample residential

Foundation~1 day
What changes

Read DATACENTER_PROXY and RESIDENTIAL_PROXY from env. Every HTTP call inside library_compare.py threads a proxy_url argument through. Per-library ladder: try datacenter once; on block/fail, fire 3 sequential requests through residential. Residential outcome:

  • 3 / 3 pass → record working_tier = "residential", spot-check passed
  • 2 / 3 pass → record working_tier = "residential", label as flaky
  • 0–1 / 3 pass → library marked blocked under residential too
When env vars are unset

Validation runs on the developer's direct IP — it's the user's CLI machine, not a server. The report's Validation section surfaces a No proxies configured banner and the synthesis system prompt is told proxy_status="direct_ip" so its recommendation includes a caveat.

Env vars to wire
DATACENTER_PROXY=http://user:pass@dc.example.com:8080
RESIDENTIAL_PROXY=http://user:pass@residential.example.com:8080
PROXY_DATACENTER_PRICE_USD_PER_GB=2.00
PROXY_RESIDENTIAL_PRICE_USD_PER_GB=3.00
Acceptance
  • Walmart re-validation with both env vars set produces working_tier on every passing library
  • Same scan without env vars set produces direct-IP results + "No proxies configured" banner in the report
T51.2

Parallel library compare + cloudscraper fallback

Algorithm~1 day
What changes

Replace the serial for lib_name, func in _LIBRARY_FUNCS.items() loop in library_compare.py:351 with a ThreadPoolExecutor(max_workers=6) that fans out 3 libraries × 2 proxy attempts in one wave. Once Phase A returns, pick the winner by preference order, not lowest latency:

# Preference cascade — first passing library wins.
LIBRARY_PREFERENCE = ["requests", "httpx", "curl_cffi"]
LIBRARY_FALLBACK   = ["cloudscraper"]  # Phase B only

passing = [
    lib for lib in LIBRARY_PREFERENCE
    if phase_a[lib].working_tier is not None
]

if passing:
    best = passing[0]                   # requests > httpx > curl_cffi
else:
    best = run_phase_b(LIBRARY_FALLBACK)  # cloudscraper, both proxies
Why preference order, not latency

A scraper that works under requests is the simplest possible production artifact — fewer deps, more readable starter code. Latency differences inside validation are noise (single sample, 200–400ms variance is normal). Preference order is a deliberate simplicity bias.

Cloudscraper as Phase B only

cloudscraper pulls in js2py, can take 4–15s when it has to solve a challenge, and fails noisily. It fires only when all 6 Phase A attempts fail — meaning the site is genuinely hard and we need a JS-aware tool. On easy sites it never runs.

Bucket B endpoints — inherit scan-level winner, verify cheaply

Bucket A endpoints run the full library × proxy fan-out. Bucket B endpoints do not — they're prerequisites (bootstrap calls, session config), and production code will hit them with the same (library, proxy_tier) the scraper already uses for Bucket A. Running an independent ladder for each B endpoint would waste requests on data we know we won't act on differently.

1Wait for Bucket A to finish; let scan_winner = (best_library, best_proxy_tier) emerge from the union of A endpoints.0 req
2For each Bucket B endpoint, fire ONE request with scan_winner. Record status + bytes.1 req
3If the verification fails, run a mini Phase A on that endpoint (library × proxy, same algorithm as Bucket A's step 1). Cap at 6 attempts per failed B endpoint.+up to 6 req on failures only
4No header reduction, no cookie scenarios, no rate-limit probe on Bucket B. Reachability + bandwidth only.0 req

Expected cost: 1 request per B endpoint on most scans (winner inherits cleanly). The mini-fan-out only fires when a B endpoint has stricter checks than the A endpoints — rare, but the cascade catches it.

Dependency
rye add cloudscraper
Acceptance
  • 3 Phase A libraries × 2 proxies fire in < 2s wall-clock (was ~6s serial)
  • Cloudscraper never appears in library_compare.attempts when any Phase A library passed
  • Cloudscraper appears in library_compare.attempts when Phase A is all-blocked
  • Bucket B reachability completes in < (1.5s × N_b_endpoints) when scan_winner inherits cleanly
T51.3

Header reduction — tier list + parallel + 2-axis cascade

Wall-clock~1.5 days
Tier list (skip-without-testing)
TierBehaviorMembers
always_requiredkept by default; never testedUser-Agent
always_optionaldropped without testingAccept-Encoding, Cache-Control, Pragma, DNT, Upgrade-Insecure-Requests, sec-fetch-*, sec-ch-ua-*
test_individuallyactual removal probeEverything else: Referer, Origin, Content-Type, Authorization, X-*, Accept, Accept-Language
Parallel + cascade algorithm
1Filter captured headers through the tier list. Remaining headers → test_individually list (typically 4–8 per endpoint).0 req
2Wave 1 (parallel, max_in_flight = 5): fire each removal test with (passing[0], working_tier[0]).~1 wave
3For each header whose removal failed in Wave 1, escalate proxy: (passing[0], residential).+1 wave
4For each header still failing, escalate library: (passing[1], working_tier[1]). And so on, hard cap = 6 attempts per header.+up to 4 waves
Result shape (one header)
{
  "Referer": {
    "required_under": [
      {"library": "requests", "proxy": "datacenter"},
      {"library": "requests", "proxy": "residential"}
    ],
    "optional_under": [
      {"library": "curl_cffi", "proxy": "datacenter"}
    ],
    "attempts": [ /* full attempt log, persisted */ ]
  }
}
What the report shows

A two-row strip: REQUIRED (collapsed to a single combo recommendation) and DROPPED. The richer per-library detail lives in the evidence accordion. Rationale: production scrapers want one config, not a decision tree.

Acceptance
  • 15-header endpoint completes header reduction in < 6s wall-clock (was ~26s)
  • Tier-skipped headers never appear as attempts in the persisted blob
T51.4

Cookie dependency — 2-axis cascade

~0.5 day
Algorithm

Three scenarios (cold, warmup, full) fired in parallel with (passing[0], working_tier[0]). Any scenario that fails cascades through the same 6-attempt ladder as T51.3.

Result shape
{
  "per_scenario": {
    "cold":   { "passes_under": [{"lib":"curl_cffi","proxy":"datacenter"}], ... },
    "warmup": { "passes_under": [{"lib":"requests","proxy":"residential"}], ... },
    "full":   { "passes_under": [...] }
  },
  "minimum_scenario": {
    "requests":     "warmup_required",
    "curl_cffi":    "cold",
    "cloudscraper": "warmup_required"
  },
  "caveat": "point-in-time; anti-bot cookies have TTL"
}
Acceptance
  • 3 scenarios × winning combo fire in < 3s parallel
  • Cascade only fires on failed scenarios
T51.5

Rate-limit probe — shrink + rotation flag

~0.5 day
Numerics
TodayAfter T51.5
Rounds[3.0, 2.0, 1.0, 0.5, 0.2][2.0, 0.5, 0.2]
Requests per round155
Max total7515
Run policyalwaysalways (never skipped)
Rotation labelling — the critical addition

If the winning proxy is rotating-residential, the probe fires 5 requests from 5 different IPs and the target physically cannot rate-limit. We still run the probe (per your call), but the result carries an explicit warning so the synthesis prompt does not take the measured delay at face value:

{
  "library_used": "requests",
  "proxy_used": "residential",
  "proxy_rotation_mode": "rotating",        // NEW
  "measurement_caveat": "Cadence measured across rotating IPs.
                          Production single-IP behavior may trigger
                          limits sooner. Treat estimated_safe_delay_s
                          as a lower bound; run extensive tests.",
  "estimated_safe_delay_s": 0.75,
  "rounds": [ ... ]
}
Synthesis prompt addendum
If `proxy_rotation_mode == "rotating"` on a rate_limit result,
prefer a conservative production delay (≥ 1.5 s) in your recommendation
regardless of the measured `estimated_safe_delay_s`. Surface the caveat
in the verdict.
Acceptance
  • Probe completes in < 15s worst-case
  • proxy_rotation_mode set correctly based on a per-proxy config flag
T51.6

Bandwidth tracking + cost projection

Cost ground-truth~0.5 day
Per-request additions

Every library function in library_compare.py already returns status, elapsed_ms, body_full, headers. Add two fields, computed before the verbose ones are stripped:

"request_bytes"  : 1840,   # sum(name+value+4) for headers + len(body)
"response_bytes" : 54120,  # sum of response headers + body
Per-endpoint rollup
{
  "bandwidth_summary": {
    "request_count": 38,
    "bytes_sent_total": 70450,
    "bytes_received_total": 1872400,
    "avg_request_bytes": 1854,
    "avg_response_bytes": 49273,
    "per_1k_requests_mb": 49.3
  }
}
Cost formula (per-tier rate, current vendor)
rate_usd_per_gb = {
    "datacenter":  PROXY_DATACENTER_PRICE_USD_PER_GB,   # 2.00
    "residential": PROXY_RESIDENTIAL_PRICE_USD_PER_GB,  # 3.00
}[best_proxy_tier]

cost_per_1k_requests_usd = (
    avg_request_bytes + avg_response_bytes
) * 1000 / 1e9 * rate_usd_per_gb

# Walmart product page on datacenter:
#   (1854 + 49273) * 1000 / 1e9 * 2.00 = $0.102 per 1k requests
# Same endpoint forced to residential:
#   (1854 + 49273) * 1000 / 1e9 * 3.00 = $0.153 per 1k requests
Synthesis prompt impact

The cost_band_low_usd and cost_band_high_usd fields stop being hand-picked. The bandwidth_summary is passed in; the prompt instructs the model to compute the band from observed bytes × the vendor rate, plus a 20% retry headroom.

Acceptance
  • Every persisted attempt carries request_bytes and response_bytes
  • Walmart re-validation report shows a computed cost per 1k requests, not a guessed range
T51.7

DB schema — append-only validation_runs

Persistence~0.5 day
Why append-only

A site's defenses change. If today's validation says datacenter works and a re-run tomorrow says datacenter blocked, both outcomes are evidence. Same pattern as llm_evals (T48): every run inserts a row, the latest non-errored row per (scan_id, endpoint_id) is "current."

Schema
CREATE TABLE validation_runs (
  id                  uuid          PRIMARY KEY DEFAULT gen_random_uuid(),
  scan_id             uuid          NOT NULL REFERENCES scans(id),
  endpoint_id         text          NOT NULL,
  status              text          NOT NULL,   -- 'complete' | 'errored' | 'partial'
  created_at          timestamptz   NOT NULL DEFAULT now(),
  duration_ms         int           NOT NULL,
  validation_data     jsonb         NOT NULL,   -- the rich blob from T51.8
  error_message       text
);

CREATE INDEX idx_validation_runs_scan_endpoint
  ON validation_runs (scan_id, endpoint_id, created_at DESC);
Read path
SELECT DISTINCT ON (scan_id, endpoint_id) *
FROM validation_runs
WHERE scan_id = $1 AND status = 'complete'
ORDER BY scan_id, endpoint_id, created_at DESC;
Acceptance
  • Alembic migration creates the table + index
  • Re-running validation on a scan inserts a new row, doesn't overwrite
T51.8

LLM payload compaction + Validation section in report

Surface~2 days
Two derived projections over validation_data

The persisted blob is the source of truth; the LLM and the report each read a slim derived view. Nothing recomputes from raw attempts at read time — the projections are materialized during validation and stored as separate JSONB keys on the same row (llm_view, report_view).

LLM payload (~250 tok / endpoint, was ~4,600)
{
  "endpoint_id": "ep_017",
  "method": "GET",
  "best": {
    "library": "requests",
    "proxy_tier": "datacenter",
    "elapsed_ms": 320
  },
  "library_matrix": {
    "requests":     {"datacenter": "ok:200",
                       "residential": "skipped"},
    "httpx":        {"datacenter": "ok:200",
                       "residential": "skipped"},
    "curl_cffi":    {"datacenter": "ok:200",
                       "residential": "skipped"},
    "cloudscraper": {"phase": "B_skipped"}
  },
  "min_required_headers": ["User-Agent", "Referer"],
  "headers_tested": 5, "headers_skipped": 12,
  "cookies": "warmup_required",
  "rate_limit": {
    "safe_delay_s": 0.75,
    "proxy_rotation_mode": "static",
    "trigger": "429@0.2s/round3"
  },
  "bandwidth": {"per_1k_requests_usd": 0.128}
}
Report section — Validation

Mock below. Same identity as the rest of the report; rendered server-side from report_view.

04 / Report mock

What the user actually sees.

A four-stat summary panel at the top; one validation card per Bucket A endpoint; raw attempts behind a per-endpoint evidence accordion. Each card has four blocks: library × proxy matrix, header strip, cookie scenarios, rate-limit chart. All rendered live from the new report_view JSON.

REPORT · VALIDATION SECTION · MOCK
Endpoints validated
4 / 4 passed
Best library × proxy
requests · datacenter
Worst rate delay
0.75 s
Cost / 1 k req
$0.128
Validated 2026-05-13 22:14Z · 102 reqs · 28 s wall-clock · 1.9 MB total · spot-check, not production-scale
Endpoint · GET /orchestra/pdp/graphql/ItemById/<id>
ep_017 · Bucket A · ~53 KB response · winner: requests + datacenter
datacenter
residential
winner
latency
requests
✓ 200
320 ms
httpx
✓ 200
290 ms
curl_cffi
✓ 200
410 ms
cloudscraper
REQUIRED
User-Agent · Referer
DROPPED · tested
Origin · Content-Type · X-Trace-Id
SKIPPED · tier
Accept-Encoding · sec-fetch-* · sec-ch-ua-* · DNT · 9 more
COOKIES
warmup_required point-in-time
Rate-limit probe · requests + datacenter · static IP
2.0s/round 5 / 5 pass
0.5s/round 5 / 5 pass
0.2s/round 429 at req 3
→ Safe production delay: 0.75 s  last_safe × 1.5
Raw attempts · 28 entries
[
  {"phase": "A", "lib": "requests",  "proxy": "datacenter",
   "status": 200, "elapsed_ms": 320, "req_bytes": 1840, "resp_bytes": 54120},
  {"phase": "A", "lib": "httpx",     "proxy": "datacenter",
   "status": 200, "elapsed_ms": 290, "req_bytes": 1840, "resp_bytes": 54120},
  // ... 26 more
]
05 / Persisted schema

One blob, two projections.

Every validation run inserts one row into validation_runs with the full attempt log in validation_data. Two derived JSONB keys — llm_view and report_view — are computed during validation and stored alongside. Reads never recompute.

Full validation_data shape (one endpoint, expanded)
{
  "endpoint_id": "ep_017",
  "url_template": "/orchestra/pdp/graphql/ItemById/<id>",
  "method": "GET",
  "validation_started_at": "2026-05-13T22:14:08Z",
  "validation_duration_ms": 24310,
  "proxy_status": "both_tiers_configured",    // "direct_ip" | "datacenter_only" | "residential_only" | "both_tiers_configured"

  "library_compare": {
    "attempts": [
      {"phase": "A", "library": "requests", "proxy": "datacenter",
       "status": 200, "elapsed_ms": 320, "is_block": false,
       "request_bytes": 1840, "response_bytes": 54120},
      {"phase": "A", "library": "httpx", "proxy": "datacenter",
       "status": 200, "elapsed_ms": 290, "is_block": false,
       "request_bytes": 1840, "response_bytes": 54120}
      /* ... all attempts, including residential samples + cloudscraper if Phase B fired */
    ],
    "passing_libraries": [
      {"library": "requests",  "working_tier": "datacenter"},
      {"library": "httpx",     "working_tier": "datacenter"},
      {"library": "curl_cffi", "working_tier": "datacenter"}
    ],
    "best_library": "requests",
    "best_proxy_tier": "datacenter"
  },

  "header_reduction": {
    "tested_headers":  ["Referer", "Origin", "X-Trace-Id", "Content-Type", "Accept"],
    "skipped_by_tier": ["Accept-Encoding", "Cache-Control", "sec-ch-ua", /* 9 total */],
    "per_header": {
      "Referer": {
        "required_under": [{"lib":"requests","proxy":"datacenter"}],
        "optional_under": [{"lib":"curl_cffi","proxy":"datacenter"}],
        "attempts": [ /* per-attempt log */ ]
      }
      /* ... one entry per tested header */
    }
  },

  "cookie_dependency": {
    "per_scenario": { /* cold | warmup | full */ },
    "minimum_scenario": { "requests": "warmup_required", /* per library */ },
    "caveat": "point-in-time; anti-bot cookies have TTL"
  },

  "rate_limit": {
    "library_used": "requests",
    "proxy_used": "datacenter",
    "proxy_rotation_mode": "static",
    "rounds": [
      {"delay_s": 2.0, "requests_fired": 5, "passes": 5, "fails": 0,
       "avg_latency_ms": 412, "trigger": null},
      {"delay_s": 0.5, "requests_fired": 5, "passes": 5, "fails": 0},
      {"delay_s": 0.2, "requests_fired": 5, "passes": 3, "fails": 2,
       "trigger": "http_429_at_req_3"}
    ],
    "estimated_safe_delay_s": 0.75,
    "last_safe_round_delay_s": 0.5
  },

  "bandwidth_summary": {
    "request_count": 28,
    "bytes_sent_total": 51520,
    "bytes_received_total": 1379240,
    "avg_request_bytes": 1840,
    "avg_response_bytes": 49258,
    "per_1k_requests_mb": 51.1,
    "per_1k_requests_usd": 0.128      // 51.1 MB × $2.50/GB
  },

  "warnings": []
}
06 / Caveats + risks

What this still doesn't prove.

Five honest limits baked into the design. Each is surfaced in the report copy so users know what they're looking at.

LimitWhy it mattersHow we surface it
Spot-check, not validation-at-scale ~100 requests over 30s ≠ 50,000 requests / hour from a sticky IP Top-of-section banner with the request count + timestamp; synthesis prompt gets validation_scope: "spot_check"
Rotating-residential rate-limit is structurally noisy Each request is a different IP; target can't rate-limit a single source it never sees again proxy_rotation_mode = "rotating" flag + measurement caveat in the JSON; synthesis prompt told to recommend ≥ 1.5s regardless
Cookie state is point-in-time Anti-bot cookies have TTLs; "cold works" today may need warmup in 5 minutes caveat field on the cookie result; pill in the report's header strip
Direct-IP fallback when proxies unset Local CLI runs validate from the developer's home IP; production scrape will differ Yellow banner at the top of the Validation section; proxy_status = "direct_ip" in JSON
Single-attempt header reduction is noisy One failed removal request promoted to "required" could be flaky, not a true requirement Cascade naturally re-tests with a different library; cross-library agreement on "required" raises confidence
07 / Rollout

Flag, ship, verify, flip.

1Land T51.1–7 behind BROWSER_RECON_USE_T51_VALIDATION=1. Default off. Legacy code path stays.~5 days
2Re-validate the two Walmart scans with the flag on; persist validation_runs rows alongside the legacy results.A/B
3Compare new library_matrix + bandwidth.per_1k_requests_usd against today's synthesis output. Sonnet's recommendation should change from "residential (inferred)" to whichever tier actually passed.verify
4Ship T51.8 (compaction + report). This is the flip — flag becomes the default; legacy code path scheduled for removal in the following release.~2 days
5Remove BROWSER_RECON_USE_LEGACY_VALIDATION path after 1 production cycle without rollback.cleanup
08 / Resolutions

All decisions closed.

No remaining open questions. Recording the closed decisions so engineering doesn't relitigate them at ticket time.

QuestionResolutionApplied at
A Per-tier proxy pricing? $2.00/GB datacenter, $3.00/GB residential. Split into two env vars. T51.1 env list · T51.6 cost formula
B Bucket B endpoints — proxy ladder, and inherit scan winner? Yes proxy. Inherit scan-level (library, proxy_tier) winner from Bucket A; verify with one request per B endpoint; mini-fan-out only on verification failure. T51.2 "Bucket B endpoints" sub-section
C Partial-completion semantics? Keep partial. Validation runs ship with explicit per-step status (complete | partial | errored) so downstream consumers know which sub-results to trust. T51.7 schema (status column already supports it)
D Cloudscraper Phase B — fall through to direct IP if both proxies block? No. Result isn't actionable for production scraping; endpoint is marked needs_better_proxy_or_browser. T51.2 Phase B definition