System Map / 2026-05-19 / v0.3.8 / internal reference

browser-recon

A complete inventory of every feature the tool performs, and the path each takes through the code.

Codebase
~59k LOC
CLI / Server
11.7k / 47.3k
Runtime
Python 3.11+
Stack
FastAPI · PG · S3
LLM
Anthropic · OpenAI · xAI
TL;DR
A thin CLI launches Chrome on the user's machine, captures real network traffic via CDP, then ships the blob to a FastAPI server that runs an eight-stage pipeline — detection, analysis, intent confirmation, bucket filtering, proxy-validated request firing, secret scrubbing, LLM synthesis, and HTML rendering — returning a verified scraping plan with runnable starter code. All proprietary logic lives server-side; the CLI ships ~130 KB of capture and polling glue.
01 CLI Client browser_recon/

The thin client.

A capture-and-poll harness. Launches Chrome via the DevTools Protocol, streams network/DOM/interaction events into a single blob, uploads to the server, then renders a 9-step progress spinner until the report URL is ready.

Subcommands

CommandWhat it doesNotes
recon scan <url> Launch ephemeral Chrome, capture network + DOM + interactions via CDP, upload, render live progress, open report. Costs 1 scan credit on first LLM call.
recon login [--api-key] Validate rec_live_… key via GET /auth/me, persist to ~/.browser-recon/config.toml. File chmod 0o600.
recon list Paginated scan history (table or JSON). --limit, --offset.
recon stop Drop .stop sentinel to gracefully kill in-progress capture. Watched by cdp_monitor.
recon retry-section Re-run a failed synthesis subsection (synthesis / notes / difficulty_drivers). Free, capped 3× per section.
recon rerun-stage Re-run filter/synthesis stage (free) or full pipeline (1 credit).
recon llm-eval Side-by-side comparison of models on captured fixtures. Admin / T46-T48.

Capture pipeline

launch Chrome chrome_launcher.py CDP subscribe + inject cdp_monitor.py stream events requests · cookies · DOM interactions · websockets fetch full cookies Network.getAllCookies gzip + upload POST /scans/{id}/capture user browses freely · multi-tab supported password fields masked client-side · anti-bot cookies truncated to 50 char preview scan_id pre-allocated
CDP capture path · client-side preprocessing happens before the wire

What gets captured

CAP · 01

Requests

URL, method, headers, post body, status, response headers, response body (truncated to max_response_kb), redirect chain.

cdp_monitor.py · CapturedRequest
CAP · 02

Cookies

Name, domain, path, flags. value_preview truncated to 50 chars; full value lives in _full_value (local-only, stripped by server scrubber).

capture/__init__.py
CAP · 03

Interactions

Clicks, inputs, scrolls, submits. Selector + text label + ARIA. Password / CVV / SSN patterns masked in injected JS before leaving Chrome.

analysis/interactions.py
CAP · 04

DOM snapshots

Full outer_html + meta tags + element counts. Only collected in --mode full.

cdp_monitor.py
CAP · 05

WebSocket frames

URL + sent/received frame payloads. Scrubber drops all frames before upload (can carry auth tokens).

cdp_monitor.py
CAP · 06

Local detections

Anti-bot, auth-flow, rate-limit signals computed during capture for warning UX. Server re-runs richer detectors after upload.

browser_recon/detection/

Live progress polling

After upload the CLI polls GET /scans/{id}/events?since=<cursor> on a 1.5s cadence. A rich.live.Live table renders nine steps with ✓ / ✗ / spinner glyphs. Polling stops at terminal status — or pauses at awaiting_confirmation, the post-flow-confirm gate where the user must confirm or abandon before phase 2 spends credits.

# 9-step pipeline rendered in the CLI
capture  detection  analysis  flow_confirm  [GATE]
                                                                 intent_filter  validation  scrub
                                                                 synthesis  render
02 Server Orchestration browser_recon_server/pipeline_orchestrator.py

Two phases, one gate.

The pipeline runs as a FastAPI BackgroundTask and splits at a user-confirmation boundary. Phase 1 produces a flow verdict and pauses. Phase 2 spends the bulk of the LLM and proxy budget — but only after the user confirms the captured flow actually matches their intent.

Phase 1 post-upload · ~5-15s detection anti-bot · auth · pagination analysis inventory · buckets · deps flow_confirm LLM · matches_intent? awaiting_confirmation user must confirm Phase 2 after confirm · ~60-180s intent_filter LLM · bucket A/B/C validation real proxies · real HTTP scrub PII / secrets stripped synthesis LLM · combined T16 replay live verify render Jinja → HTML POST /scans/{id}/confirm deterministic LLM call human gate terminal Per-step session_scope() · events stream incrementally to CLI poll · ThreadPoolExecutor for parallel LLM calls
pipeline_orchestrator · run_pipeline_phase1 / run_pipeline_phase2

Database tables

GroupTablePurpose
identityusersEmail, role (super_admin/admin/user/guest), tier, credits, waitlist status.
api_keysOne active key per user; SHA-256 hash stored, 9-char preview, last_used_at.
magic_link_tokens32-char tokens (hashed). Single-use via UPDATE…WHERE consumed_at IS NULL. 15-min TTL.
dashboard_sessionsHttpOnly cookie backing. 7d sliding expiry, 30d hard cap.
scansscansUUID PK. intent_text + clarifying_qa KMS-encrypted. Holds status, bucket_assignment, flow_confirm_result, rerun_counts, expires_at.
reports12-char base32 ID. findings JSONB (detection, analysis, validation). synthesis JSONB. Mirrored cost/confidence columns.
scan_eventsOne row per step transition. Drives CLI live progress polling.
findingsPer-report detection rows: kind, vendor, severity, confidence, evidence JSON.
validation_runsT51.8 — per-endpoint validation result (LLM/report projections).
llmllm_callsOne row per call: model, provider, token counts, cache read/write tokens, cost, latency, S3 prompt+response paths, retry_count.
llm_experimentsAdmin sandbox runs (T23) — isolated from llm_calls so cost cards stay clean.
commercesubscriptionsStripe state.
waitlistEmail queue + tweet templates linkage.
processed_stripe_eventsIdempotency guard (event ID as PK).
adminaudit_logAppend-only: scan_view_admin, scan_delete, role_change.
app_configSingleton-shaped key/value (theme color, public handle, feature flags). 60s read-through cache.
tweet_templatesCurated X copy for waitlist users.
miscfeedbackPer-claim user ratings; daily-unique on (report, claim, ip_hash).

Persistence model

Blob storage

S3 (post-fix #8). Gzipped JSON. Optional KMS server-side encryption via BROWSER_RECON_KMS_KEY_ARN.

Encryption

KMS-backed wrap of Scan.intent_text and each clarifying_qa[*].answer (T14). Plaintext fallback for dev. Decryption on read via load_intent_text().

Session boundaries

Each pipeline step opens a fresh session_scope() — one transaction per step, not per pipeline. This lets ScanEvent rows leak incrementally so the CLI poll sees progress in real time.

Concurrency

BackgroundTask + ThreadPoolExecutor for parallel LLM calls. Row-level SELECT FOR UPDATE on Postgres for rerun safety.

03 Analysis Server analysis_server/

Inventory and classify.

Transforms a raw capture into a structured endpoint inventory: site shape, framework hints, deduplicated endpoint groups with response shapes, dependency edges, and four evidence signals for the LLM bucket filter.

ModuleComputesAlgorithm
architecture.pySite shapeDecision ladder over api_count vs page_count (after noise filtering). Labels: page_based / spa / api_driven / hybrid.
framework.pyFront-end framework hintsRegex on HTML markers — __NEXT_DATA__, data-reactroot, ng-version, Vue SSR, <meta name="generator">.
noise.pyTracker filtering~60-domain blocklist (GA, Hotjar, Segment, LinkedIn pixels, consent managers) + OPTIONS preflight drop.
url_templating.pyEndpoint dedupCollapses ID-shaped path segments → <id>. Heuristics: numeric, hex (8+), UUID, email, long-mixed (≥16 chars + letters + digits + ≥8 unique).
bucket_signals.py4 evidence signalstiming (ms to next data call), response size (pixel flag at 100B threshold), set-cookie names (privacy-safe), downstream consumption (count of later endpoints reusing this one's cookies, via exact-name match with boundary anchors).
bucket_filter.pyPost-LLM guardrailsCoverage check (default-to-C) · first-party scoping (eTLD+1 OR ≥5-request host) · upstream-of-A promotion (BFS over dependency graph).
dependency_chain.pyValue-flow edgesWalk response → request value matches. Prune generics (booleans, short strings, <1000 numbers). Cap 100 edges.
flow_segmentation.pyClick → XHR groupsT5 — 1.5s correlation window, first-come-first-served claim, unclaimed → background_requests.
headers.pyCORS + replay headersPer-base-URL CORS summary. Replay-headers fix: prefer captured-cookie inventory over post-scrubbing empty Cookie: header.
response_summary.pyPer-MIME summary500-char cap. JSON: top-level keys + leaf placeholders. HTML: title + JSON-LD types + element counts. Safe to run pre-scrub.

Bucket taxonomy

BucketMeaningPositive signalsGuardrail
APrimary user dataJSON >500B · consumed by later endpoints · timing <1s to next data calleTLD+1 must match target or host had ≥5 captured requests.
BSession / auth prereqSets cookies that A later carries · moderate timing · consumed by fewAny C upstream of A is promoted to B.
CNoise (trackers, pixels)Pixel-sized response · fire-and-forget · downstream consumption = 0Default for anything the LLM dropped.
04 Detection Server detection_server/

Anti-bot fingerprinting.

Eight rule modules run in sequence; each produces unified Findings with confidence-weighted evidence. Cookie domain scoping (RFC 6265 + eTLD+1) is the lever that prevents third-party CDN false positives.

Anti-bot vendor coverage

VendorSeverityEvidence patternsNote
Cloudflare tier 0–6 cf-ray, __cf_bm, cf_clearance, cf-mitigated, /cdn-cgi/challenge-platform/, 503+body markers Bot Fight Mode is a heuristic — __cf_bm alone without challenge JS.
DataDome tier 0–4 x-datadome, datadome_device_id, "blocked by datadome" body, captcha URL on DD origin /captcha/ narrowed to DataDome origin only — bare substring was false-positive-heavy.
Akamai BM presence _abck, ak_bmsc, bm_sz, bm_sv, x-akamai-session-info Cookie + header scoped to primary eTLD+1.
PerimeterX presence _pxhd, _pxvid, _px3, _pxde, host px-cdn.net
Imperva / Incapsula presence visid_incap_*, incap_ses_*, nlbi_*, x-iinfo, /_incapsula_resource
Kasada presence kpsdk cookie
Arkose Labs presence Vendor-specific URL/script markers Split from Kasada in Tier A fixes (separate vendors).

Confidence weights

Each piece of evidence contributes a weight; Finding.confidence = min(1.0, Σ weights).

0.30

cookie / html_body

Vendor-specific cookie names (__cf_bm) or body markers (challenge page HTML).

0.25

header / js_file

Vendor headers (cf-mitigated) or distinctive JS file paths.

0.15

url_pattern

Vendor domain in any request URL.

0.10

interaction

User had to solve a challenge.

0.05

status_code · timing · other

Weak signals — 429, suspicious response timing, fallbacks.

Other detectors

DetectorSignal typesHow
auth_flow.py5 typeslogin_endpoint · bearer_token · oauth_redirect (tightened to reject ad-tech) · www_authenticate · token_refresh.
pagination.py3 patternsquery_param (≥2 distinct values), cursor, Link: rel="next" header.
rate_limit_signals.py2 pathsStatus 429 OR *RateLimit-Limit header (legacy X- + IETF draft variants).
graphql.py2 paths"graphql" substring in URL OR JSON body with top-level query key.
block_pages.py8 vendor signatures10 Cloudflare body markers, plus DataDome / PerimeterX / Imperva / Kasada / Akamai / CAPTCHA / generic access-denied. Confidence as float [0,1].
05 Validation Server validation_server/

Measure, don't guess.

Real HTTP requests through real proxies. The single feature that makes the tool's recommendation grounded in what worked rather than what an LLM expected to work. Worst case: ~75-100 requests per endpoint, with a mandatory 1.5s delay between every fire.

Cascade architecture

STEP 0 endpoint selection ≤5 endpoints · GET only templated + shape dedup STEP A library_compare requests · httpx · curl_cffi × datacenter + residential STEP B header_reduce drop each header · re-test 5 in-flight parallel waves STEP C cookie_dependency cold · warmup · full 3 scenarios in parallel STEP D rate_limit_probe delays [2.0, 0.5, 0.2]s 5 req/round · 3 rounds FAIL → ESCALATE Phase B fallback cloudscraper (JS challenge) 4-15s per challenge 2-axis cascade lib × proxy combos cap 6 attempts per header 2-axis cascade lib × proxy combos cap 6 attempts per scenario recovery probe wait min(Retry-After, 60s) one verify request verified plan: (best_library, passing_libraries, best_proxy_tier, required_headers, cookie_scenario, safe_delay_s) → feeds synthesis prompt + starter code
Validation cascade · 1.5s mandatory delay between every request · ~75-100 req/endpoint worst case

Per-validator detail

VAL · A

library_compare

3 libraries × 2 proxy tiers parallel. Datacenter = sticky 1 IP, 1 request. Residential = rotating, 3 samples (quality: 3/3 pass, 2/3 flaky, ≤1/3 blocked). Best library picked by preference order, not latency.

library_compare.py
VAL · B

header_reduce

Wave-parallel header-drop test, 5 in-flight. Always-required: User-Agent. Always-optional: sec-fetch-*, sec-ch-ua*, encoding hints. Test set: 4-8 headers, each escalated through lib×proxy cascade.

header_reduce.py
VAL · C

cookie_dependency

3 scenarios fire in parallel: cold (no cookies), warmup (GET homepage → harvest Set-Cookie → replay), full (user-supplied). cookies_required = minimum_scenario != cold.

cookie_dependency.py
VAL · D

rate_limit_probe

Graduated rounds at decreasing delays. Triggers: http_429, block detected, RateLimit-Remaining: 0, throttled (3× baseline latency). Safe delay = prev × 1.5 + 50% margin. Caveat surfaced if winning proxy rotates IPs.

rate_limit_probe.py
VAL · E

proxy_test

If user supplied --proxies, validate each against the first passing endpoint with the winning library.

proxy_test.py
VAL · F

replay

Ground-truth pass after synthesis — recommended endpoints through recommended library to confirm advice works in practice.

replay.py
Block detection is shared across all validators via detect_block() in the detection server. Returns (is_block, block_type, confidence). Types: cloudflare_challenge, datadome_challenge, perimeterx_challenge, akamai_challenge, captcha, rate_limit, forbidden, not_found.
06 Scrubber scrubber/

Strip the secrets before anything else reads the blob.

Runs after validation — validation needs real cookies and auth headers to test against the live site, but the persisted blob (and everything the LLM ever sees) must be sanitised. Deep-copies capture.raw; never mutates it.

Scrub scope by field

FieldRuleReason
URL path segments ID-shaped → <id> Heuristic: contains @, UUID, 32+ hex, or len ≥16 + letters + digits + ≥8 unique.
Query params 57-name allowlist → <scrubbed> token, email, key, password, cc_*, ssn, cvv, iban, passport, etc.
URL fragment Dropped entirely Client-only, may hold OAuth implicit tokens.
Headers Value-only scrub (names preserved) 27-name allowlist + suffix policy (-token, -auth, -key, -secret, -bearer) + T35.4 token-shape heuristic (≥40 chars matching token regex).
JSON body Recursive shape replacement str → "<string:N>" (length kept), bool/num → placeholders, lists capped at 3 elements, parse-fail → drop.
HTML body Strip text + values Truncate 100KB. Strip <script> body + value="" + <p>/<span>/<div> text. Keep title and h1-h3.
Binary bodies Dropped Images, fonts, opaque payloads.
Cookies Clear preview + pop _full_value Full value used only locally for validation/replay (T34.1).
Interactions Clear input_value Form fields, sensitive user-typed data.
WebSocket frames All dropped Can carry auth tokens, session bootstrap.

Token-matching rule

Both pattern and field name are tokenized on [_\-.\s]+, lowercased, and require a contiguous token sequence match.

# MATCHES
password in client_password     # contiguous
cc_number in customer_cc_number_field   

# DOES NOT MATCH
password in passport     # 'pass' ≠ 'password' token
cc_number in cc_field_number     # tokens not contiguous
Rules version history
v1 → v2Added set-cookie header (response cookies carry session tokens).
v2 → v3Expanded identity headers (X-User-ID, X-Customer-ID, X-Session-ID) + suffix policy.
v3 → v4PII-shaped X-* headers (X-Forwarded-For, X-Real-IP, X-Device-ID) + token-shape heuristic.
07 LLM Layer llm/ + prompts/

Provider-agnostic, observability-instrumented.

Prefix-based routing dispatches every prompt to the right provider. The dispatcher records cost, tokens, latency, and S3 prompt dumps per call. Combined T16 synthesis collapsed 3 overlapping 50-89K-token calls into 1.

Routing

Model prefixProviderCache style
claude-*AnthropicProviderExplicit cache_control markers. Separate rates for cache_write_5m (1.25× input) and cache_read (0.1× input).
gpt-* · o1 · o3OpenAICompatibleProviderAuto-cache. Discounted cached_input rate; no separate write cost.
grok-*OpenAICompatibleProvider (xAI base_url)Same as OpenAI compat.

Model selection cascade: env BROWSER_RECON_LLM_MODEL_<PROMPT_NAME> (per-prompt override) → BROWSER_RECON_LLM_MODEL (global) → DEFAULT_MODEL.

Prompts

PromptStageOutput shape
intent_clarifierPre-capture{needs_clarification, clarifying_question, inferred_data_kind, confidence} · 3-round loop, 1 credit on first call.
flow_confirmPhase 1 gate{flow_summary, matches_intent, mismatch_reason, closest_match_intent, confidence}
intent_filterPhase 2 step 4Bucket A/B/C lists + rationale.
scan_synthesisPhase 2 step 7Combined T16: {recommendation, verdict, starter_code} in one call.
recommendation · verdict · starter_codeLegacy pathSame shapes split across 3 calls. Active when BROWSER_RECON_USE_LEGACY_SYNTHESIS=1.
notesPhase 2 step 7 (parallel)Implementation notes.
difficulty_driversPhase 2 step 7 (parallel)Per-request difficulty assessment.
Why combined synthesis (T16): pre-T16 made 3 separate Sonnet calls with ~50-89K overlapping tokens — Staples smoke test cost ~45¢ instead of the ~12-15¢ target. Combined collapses to a single call. Legacy modules kept on disk; flip back via BROWSER_RECON_USE_LEGACY_SYNTHESIS=1.

Observability

Every LLMClient.run() writes an llm_calls row when scan_id is set: model, provider, input/output tokens, cache read/write tokens, USD cost, latency, error class, retry count, and S3 paths to dumped prompt + response. Silently no-ops in test entry points. One row per run(); retry_count captured on re-attempt (1-shot retry on LLMParseError).

08 Web Layer routes/ · auth/ · renderers/

Two auth models, one user.

CLI uses bearer API keys (stateless). Web uses HttpOnly session cookies with sliding expiry and deterministic CSRF. Magic-link is the only entry point — passwords never exist.

Auth flows

CLI

recon login --email ...
  
POST /auth/login  # {email, client_id}
  
[email] magic link
  
GET /auth/callback?token=...
  
server stashes API key
  (in-memory · 5min TTL)
  
CLI polls POST /auth/poll
  
key written to config.toml
  (chmod 0o600)

Web

GET /dashboard
  
302 /dashboard/login
  
POST /dashboard/login
  
[email] magic link
  
GET /auth/callback?token=...
  
session cookie set
  (HttpOnly · 7d sliding · 30d hard cap)
  
302 /dashboard
  
CSRF token = HMAC(session, secret)

RBAC matrix

RoleOwns scansAdmin UIRole mutationStripe events
super_adminread
adminread-onlyread
userown only
guest

Routes

RouterNotable endpoints
auth.pyPOST /auth/signup, POST /auth/login (waitlist-gated), GET /auth/callback, POST /auth/poll (CLI key pickup), GET /auth/me
scans.pyPOST /scans (gzipped blob upload), POST /scans/{id}/capture, POST /scans/{id}/confirm, POST /scans/{id}/rerun-prompt, DELETE /scans/{id}
intent.pyPOST /scans/draft, POST /scans/{id}/intent, POST /scans/{id}/intent/answer, POST /scans/{id}/intent/skip
reports.pyGET /reports/{id}, GET /r/{id} (alias). 410 if expired.
dashboard.pyGET /dashboard, /dashboard/scans, /dashboard/data, /dashboard/scans/{id}/confirm (flow gate UI)
api_keys.pyOne-shot key reveal post POST/Redirect/Get.
admin.py · admin_t55.pyKPI dashboard, cross-user scan list, waitlist queue, tweet templates, theme config, user tier/credit mutation.
evals.pyAdmin eval matrix (T48): run batch, rerun cell, status poll.
public_pages.py/ · /pricing · /user-guide · /waitlist · POST /waitlist/join · POST /waitlist/tweet
stripe_webhook.pyPOST /webhooks/stripe — sig-verified, idempotent. checkout.session.completed → +10 credits + email API key.
feedback.pyAnonymous per-claim ratings.

Email

Provider abstraction (email/service.py). Configured via BROWSER_RECON_EMAIL_PROVIDER:

Templates live under templates/emails/ as Jinja2 with inline CSS.

09 End-to-End Flow

One full scan, from recon scan to a report URL.

$ recon scan https://walmart.com --template products
   
   ├─ credit pre-flight check (advisory; server is source of truth)
   ├─ ASCII banner · suppressed under --quiet
   ├─ intent clarifier loop  # LLM · 1 credit on first call
        # 3-round cap · skip refund if before first LLM call
   ├─ launch Chrome → CDP monitor  # ephemeral profile by default
        # user browses · clicks · navigates · Ctrl+C to stop
   ├─ fetch full cookies · gzip · upload  # POST /scans/{id}/capture
   
SERVER · Phase 1  # BackgroundTask, ~5-15s
   ├─ detection      # anti-bot, auth, pagination, rate-limit signals
   ├─ analysis       # architecture, framework, inventory, buckets, deps
   └─ flow_confirm   # LLM · status = awaiting_confirmation

CLI  # renders verdict, awaits user confirm/abandon
   
POST /scans/{id}/confirm
   
SERVER · Phase 2  # BackgroundTask, ~60-180s
   ├─ intent_filter  # LLM · Bucket A/B/C + guardrails
   ├─ validation     # real HTTP · real proxies · ~75-100 req/endpoint
                     # library_compare → header_reduce → cookies → rate_limit
   ├─ scrub          # PII/secrets stripped, blob re-uploaded to S3
   ├─ synthesis      # LLM · combined T16 (rec + verdict + code) + notes + drivers
   ├─ replay         # best-effort live verification through recommended library
   └─ render         # Jinja → HTML cached
                       # status=complete · expires_at = retention_for(tier)
   
CLI opens report URL  # /reports/{base32_id}
10 Architecture Decisions

Eight choices that shape the system.

DEC · 01

Thin CLI

All detection, validation, scrubbing, synthesis runs server-side. CLI ships no proprietary heuristics, no proxy credentials, no prompts. ~130KB installed.

DEC · 02

Two-phase gate

Phase 1 ends at awaiting_confirmation. A paid synthesis call is only made after the user confirms the captured flow actually matches intent. User can abandon before most of the budget is spent.

DEC · 03

Validate before scrub

Validation runs on raw capture (with full cookies + auth headers) so it has ground truth locally. Scrubber runs after, so upload contains no secrets.

DEC · 04

Pre-scrub-safe analysis

Bucket signals and response summaries compute only safe-to-leak shapes (names, counts, lengths). They survive scrubbing intact and don't depend on raw values.

DEC · 05

Cookie scoping via eTLD+1

Anti-bot and bucket guardrails apply RFC 6265 leading-dot + eTLD+1 matching. Eliminates most third-party CDN false positives.

DEC · 06

Per-step DB sessions

One session_scope() per pipeline step, not per pipeline. Lets ScanEvent rows leak incrementally so CLI polling sees progress in real time.

DEC · 07

Combined T16 synthesis

3 overlapping 50-89K-token Sonnet calls collapsed into 1. Cost target ~12-15¢/scan (down from ~45¢ pre-T16). Legacy 3-call path kept on disk behind env flag.

DEC · 08

Modular LLM routing

Provider protocol + prefix-based dispatcher. Adding xAI Grok was ~30 lines (provider config + routing entry). Core client untouched.