Task T53 · v0.3.0 release plan · 2026-05-13 · spec draft

From bundled monolith to thin client.

The customer's machine should run only what physically must run there: Chrome and its capture listener. Everything else — detection, analysis, validation, scrubbing, synthesis, rendering — moves server-side. The CLI ships to PyPI as version 0.3.0 and weighs roughly a tenth of what 0.2.x did.
Version
0.3.0
Distribution
PyPI · public
Install
pipx install browser-recon
Effort
~5 days
Status
Spec · ready

One ship moves five things at once: secrets, IP, weight, UX, distribution.

Proprietary code (validation, analysis, detection, scrubbing) moves to the server. Operator secrets (proxy URLs) move to AWS Secrets Manager — never the customer's shell. The CLI shrinks to capture + auth + upload + poll. A new scan_events table feeds a live progress display on the CLI. The whole thing publishes to PyPI as browser-recon==0.3.0, installable via pipx install browser-recon.

~20 MB → ~2 MB
CLI install size
6 → 0
Operator env-var reads on client
silent → live
Customer progress feedback
9
Sub-tasks · sequenced
01 / Rationale

Three problems, one architectural fix.

Problem todaySurfaced whenFixed by
Operator proxy credentials need to live in the customer's shell for validation to work — the CLI's _proxy_env() reads os.environ.get("DATACENTER_PROXY"). Reviewing the Staples 2026-05-13 scan: library_comparison had no datacenter sub-dict because the customer's shell didn't export the var. Setting it there would leak vendor credentials. Validation moves to the server, reads from AWS Secrets.
The wheel ships browser_recon_server/ source by mistake ([tool.hatch.build.targets.wheel] packages lists both). Any customer can read your synthesis prompts, scoring rules, detection-vendor heuristics, and admin DB query code. Drop browser_recon_server from the wheel; the moved subdirs of browser_recon/ also stop shipping.
The CLI is silent for 60–120 seconds during the scan because all the work is local but synchronous, and once we move it server-side there's nothing visible at all. Every scan. The synchronous "spinner that just spins" UX is acceptable when local; becomes a support magnet once server-side. scan_events table + GET /scans/<id>/events polling + a rich-driven live spinner per pipeline step.
02 / What changes

Customer-side shrinks; server-side grows.

Same product, different surface. The CLI exits the data-processing business and becomes a capture-and-upload client with a progress display. Everything that does real work — and everything that contains your business logic — runs server-side.

Today · v0.2.5

Bundled monolith

~20 MB install with proprietary detection rules, validation logic, scrubbing rules. Operator secrets travel through the customer's shell.

  • browser_recon/capture/ — Chrome + CDP
  • browser_recon/cli/ — login, prompts
  • browser_recon/validation/ — proxy + library cascade
  • browser_recon/analysis/ — bucket signals, endpoint inventory
  • browser_recon/detection/ — anti-bot rules
  • browser_recon/scrubber.py — cookie/auth scrubbing
  • browser_recon/reporting/markdown_local.py
  • leaked browser_recon_server/ — entire server source

Thin client

~2 MB install, only the non-proprietary glue. Server holds every secret and every rule.

  • browser_recon/capture/ — Chrome + CDP unchanged
  • browser_recon/cli/ — login, prompts, polling spinner +poll
  • browser_recon/client.py — thin httpx wrapper new
  • validation/analysis/detection/scrubber → server
  • markdown_local.py → dropped (T52 download HTML replaces it)
  • browser_recon_server/ → stops shipping
03 / Architecture

One upload, async pipeline, polled progress.

CLIENT MACHINE Chrome (CDP) CAPTURE recon CLI PROMPTS · AUTH · POLL rich spinner LIVE PROGRESS ~2 MB install no secrets no proprietary code POST /capture GET /events (poll) SERVER (RENDER) FastAPI router Background task 7. detection ~50 MS · DETERMINISTIC 8. analysis ~200 MS · DETERMINISTIC 9. intent_filter ~21 S · CLAUDE SONNET 10. validation (T51 cascade) ~25 S · HTTP + PROXIES 11. scrub ~50 MS 12-13. synthesis + notes + drivers ~40 S · SONNET + GROK-3-MINI 14. render report ~30 MS · JINJA EMIT_EVENT PERSISTENCE scans (Postgres) SCAN ROWS · JSONB scan_events PIPELINE TIMELINE · NEW validation_runs APPEND-ONLY (T51.7) S3 + KMS RAW BLOBS AWS Secrets PROXY CREDS · NEW VALIDATION RUNS PROXY URLS
Fig. 1 — Thin client uploads once, polls events. All processing is server-side.

Step-by-step request flow

1Customer runs recon scan https://staples.com. CLI launches Chrome, monitors via CDP, accumulates the capture blob locally.~30–120s capture
2Customer hits Ctrl+C. CLI bundles the capture, POST /scans/<id>/capture with the raw payload (cookies + auth headers still real). Server returns 202 Accepted + a status_url.~1s upload
3Server kicks off background task, immediately emits scan_events row (scan_id, step='detection', status='started').< 100ms
4CLI polls GET /scans/<id>/events?since=<cursor> every 1.5s. New events update the rich live spinner per step.background
5Pipeline walks the 8 stages (detection → analysis → intent_filter → validation → scrub → synthesis + notes + drivers → render). Each stage emits started and complete events. Errors emit errored + message.~60–120s total
6Final render: complete event fires. CLI stops polling, prints the report URL, exits 0.done
04 / Sub-tasks

Nine changes, sequenced.

Sequenced so each step is independently shippable. T53.1 lands first because it's purely additive (live progress for current scans). T53.6 (wheel trim) is the last code change before the PyPI release.

T53.1

scan_events table + emit helper + events endpoint

Polling~0.5 day
Schema
CREATE TABLE scan_events (
  id           uuid         PRIMARY KEY DEFAULT gen_random_uuid(),
  scan_id      uuid         NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
  step         text         NOT NULL,
  status       text         NOT NULL,    -- 'started' | 'complete' | 'errored'
  message      text,                       -- human-readable progress text
  metadata     jsonb,                      -- optional: endpoint counts, partials
  created_at   timestamptz  NOT NULL DEFAULT now()
);

CREATE INDEX idx_scan_events_scan_id_time ON scan_events (scan_id, created_at);
Helper
def emit_event(session, scan_id, step, status="started",
               message="", metadata=None):
    session.add(ScanEvent(
        scan_id=scan_id, step=step, status=status,
        message=message, metadata=metadata or {},
    ))
    session.flush()
Endpoint
@router.get("/scans/{scan_id}/events")
def get_events(scan_id: UUID, since: datetime | None = None):
    # since = ISO timestamp of last event the CLI received
    # Returns events strictly newer than `since`, ordered by created_at ASC.
    # Includes a `cursor` (latest created_at) so the CLI can advance.
Acceptance
  • Alembic migration creates the table + index
  • Existing pipeline (today's synchronous synthesis path) calls emit_event at every step boundary; at minimum: detection, analysis, intent_filter, validation, synthesis, render
  • Endpoint returns the new events; since is exclusive
T53.2

Port proprietary code from CLI to server

Foundation~1 day
What moves
SourceDestination
browser_recon/validation/browser_recon_server/validation_server/
browser_recon/analysis/browser_recon_server/analysis_server/
browser_recon/detection/browser_recon_server/detection_server/
browser_recon/scrubber.pybrowser_recon_server/scrubber.py

Same code, new import paths. Keep browser_recon/validation/ etc. on disk during the transition so nothing breaks; delete them in T53.6 once the new endpoint (T53.3) is verified end-to-end. Run the existing test suite (3,636 tests) against the copies under their new module names — most tests pass as-is with an import update.

Acceptance
  • Server can import + call the moved modules at their new paths
  • Existing CLI still works (legacy code path intact)
  • Tests duplicated under tests/unit/server_* all green
T53.3

POST /scans/<id>/capture endpoint

New surface~0.5 day
Request shape
POST /scans/{scan_id}/capture
Authorization: Bearer rec_live_...
Content-Type: application/json

{
  "target_url": "https://walmart.com",
  "intent_text": "Need to search product, get data and reviews.",
  "starter_template": "products",
  "capture": { /* raw browser capture blob, unscrubbed */ }
}
Response (202 Accepted)
{
  "scan_id": "82f42438-...",
  "status": "queued",
  "events_url": "/scans/82f42438-.../events",
  "report_url": "/r/<slug>"            // known eagerly; only valid after render
}
Server-side flow
  • Validate auth + payload shape
  • Insert scans row with status='processing'
  • Put raw blob in S3 (KMS-encrypted) under captures/<scan_id>.json.gz
  • Queue background task that calls the orchestrator (T53.4)
  • Return 202 with the status URL — total latency under 500 ms
T53.4

Pipeline orchestrator

Glue~0.5 day

One function that walks the 8 stages in order, emitting events at each boundary. Pure composition over the modules T53.2 ported.

def run_pipeline(scan_id: UUID, session: Session) -> None:
    capture = load_capture_from_s3(scan_id)

    with step(session, scan_id, "detection"):
        detection = run_detection(capture)
        save_detection(session, scan_id, detection)

    with step(session, scan_id, "analysis"):
        analysis = run_analysis(capture, detection)
        save_analysis(session, scan_id, analysis)

    with step(session, scan_id, "intent_filter"):
        buckets = run_intent_filter(analysis, capture.intent_text)
        save_buckets(session, scan_id, buckets)

    with step(session, scan_id, "validation"):
        proxies = load_proxy_secrets()           # T53.5: AWS Secrets
        validation = run_validation(buckets, proxies)
        save_validation(session, scan_id, validation)

    with step(session, scan_id, "scrub"):
        scrubbed = scrub_capture(capture)
        save_scrubbed_blob(session, scan_id, scrubbed)

    with step(session, scan_id, "synthesis"):
        synthesis, notes, drivers = run_synthesis_orchestrator(
            scan_id, scrubbed, detection, buckets, validation,
        )
        save_synthesis(session, scan_id, synthesis, notes, drivers)

    with step(session, scan_id, "render"):
        render_report(session, scan_id)

    mark_complete(session, scan_id)

The step context manager emits started on enter, complete on clean exit, errored with the exception message on any failure (and re-raises so the scan marks as failed).

T53.5

AWS Secrets Manager for proxy credentials

Security~0.5 day
What changes

Currently library_compare.py reads DATACENTER_PROXY and RESIDENTIAL_PROXY from environment variables. After the port (T53.2), the server-side copy reads them from AWS Secrets Manager via the existing boto3 setup.

Implementation
import boto3, json
from functools import lru_cache

@lru_cache(maxsize=1)
def _proxy_secrets() -> dict[str, str]:
    client = boto3.client("secretsmanager", region_name=AWS_REGION)
    resp = client.get_secret_value(SecretId=PROXY_SECRET_ARN)
    return json.loads(resp["SecretString"])

def load_proxy_secrets() -> dict[str, str | None]:
    secrets = _proxy_secrets()
    return {
        "datacenter":  secrets.get("DATACENTER_PROXY"),
        "residential": secrets.get("RESIDENTIAL_PROXY"),
        "dc_rotation": secrets.get("DATACENTER_PROXY_ROTATION_MODE", "sticky"),
        "res_rotation": secrets.get("RESIDENTIAL_PROXY_ROTATION_MODE", "rotating"),
    }
IAM

Server's task role gets secretsmanager:GetSecretValue for the proxy secret ARN. Render env now needs only AWS credentials (or OIDC integration if available).

Acceptance
  • Proxy URLs no longer present in Render env
  • load_proxy_secrets() returns the same shape _proxy_env() returns today
  • Secret rotation works without code redeploy (lru_cache TTL or manual flush on SIGHUP)
T53.6

Strip moved packages from the CLI wheel

Breaking~5 min
pyproject.toml
[tool.hatch.build.targets.wheel]
packages = ["browser_recon"]   # was: ["browser_recon", "browser_recon_server"]
Filesystem
rm -rf browser_recon/validation
rm -rf browser_recon/analysis
rm -rf browser_recon/detection
rm browser_recon/scrubber.py
rm browser_recon/reporting/markdown_local.py

Update browser_recon/cli/main.py to remove imports of the deleted modules. The CLI now physically cannot do validation locally.

Version bump
[project]
name = "browser-recon"
version = "0.3.0"   # was 0.2.5
Acceptance
  • rye build produces a wheel with browser_recon_server absent from RECORD
  • Test suite still green after the deletions (tests for the moved modules now live under tests/unit/server_*)
T53.7

Thin CLI: send raw capture, poll events, live spinner

UX~0.5 day
Three changes
  1. Replace local validation with capture upload. After capture finishes, instead of running validate() locally, the CLI calls POST /scans/<id>/capture with the raw blob.
  2. Add a polling loop that calls GET /scans/<id>/events?since=<cursor> every 1.5 seconds, advancing the cursor each round.
  3. Render a rich.live.Live spinner per step. Each step has a status (pending · running · complete · errored) and a message line.
Mock — what the customer sees
$ recon scan https://walmart.com browser-recon: scanning https://walmart.com mode: quick · output: /home/.../walmart.com/2026-05-13T... Pick a starter template: > products Describe what data you want to scrape: > Need to search product, get data and reviews. Launching Chrome... browse the site, Ctrl+C when done. [capture] 936 requests · 58 cookies · 72 endpoints · 106s Uploading capture... done. scan_id=7ef2f689 detection complete 2 anti-bot vendors detected analysis complete 38 endpoints categorized intent_filter complete Bucket A: 4 · B: 8 · C: 26 validation running Validating 4 Bucket A endpoints (2/4) · 18s scrub pending synthesis pending render pending
Acceptance
  • CLI emits no requests besides POST /capture and GET /events after capture
  • Spinner reflects every step transition within 1.5–3 s
  • Errored steps print the error message inline, then the CLI exits with code 1
  • --no-progress flag still ships scan to completion without the spinner (CI use)
T53.8

PyPI publishing setup

Distribution~0.5 day
One-time setup
  1. Create a PyPI account at pypi.org/account/register/
  2. Enable 2FA (required for publishing)
  3. Generate an API token scoped to the browser-recon project (after first manual upload, you can scope it; for the very first upload you need an account-wide token)
  4. Save the token to a password manager and to GitHub Secrets as PYPI_API_TOKEN
  5. (Optional) Create a TestPyPI account at test.pypi.org for dry-run uploads
Build + upload (local)
rye build                              # produces dist/browser-recon-0.3.0-py3-none-any.whl + .tar.gz
twine check dist/*                      # validates metadata before upload
twine upload dist/*                     # prompts for token; uploads to PyPI

# Dry-run via TestPyPI first if you want:
twine upload --repository testpypi dist/*
pipx install --index-url https://test.pypi.org/simple/ browser-recon==0.3.0
Automated release (GitHub Actions)

One workflow file at .github/workflows/release.yml triggers on a version tag, runs the test suite, builds the wheel, and pushes to PyPI:

name: release
on:
  push:
    tags: ["v*"]
jobs:
  publish:
    runs-on: ubuntu-latest
    permissions:
      id-token: write          # for PyPI trusted publishing
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install rye twine
      - run: rye sync && rye run pytest -q
      - run: rye build
      - uses: pypa/gh-action-pypi-publish@release/v1
        with:
          packages-dir: dist/
Trusted publishing (preferred over tokens)

PyPI supports OIDC-based trusted publishing from GitHub Actions — no API token needed. Register your repo + workflow in the PyPI project settings under "Publishing". This is the modern recommended approach.

What customers run
pipx install browser-recon              # preferred — isolated venv
recon login                             # enter API key
recon scan https://walmart.com          # go
Acceptance
  • pipx install browser-recon works on a fresh Linux/Mac/Windows machine
  • recon --version reports 0.3.0
  • Wheel size < 5 MB (target ~2 MB)
  • Tag-driven release pipeline works end-to-end
T53.9

Cutover + back-compat

Migration~0.5 day
Versioning headers

Every request from the CLI sends User-Agent: browser-recon/0.3.0 so the server can log version distribution and gracefully reject ancient clients with a clear "please upgrade" message.

Update notice on the CLI

CLI hits GET /version on startup. If its version is older than the latest released, prints a non-blocking one-liner:

browser-recon 0.3.0 (a newer version 0.3.1 is available; run `pipx upgrade browser-recon`)
Legacy CLI back-compat

Customers on v0.2.x still call the old POST /scans/<id>/complete endpoint with their locally-computed validation result. Keep that endpoint alive for one release cycle (until v0.4.0). Customers who haven't upgraded by then get a 410 Gone with the upgrade message.

Migration messaging

One email to all active customers two weeks before deprecating the legacy endpoint. Dashboard banner for super-admin showing the version-distribution count.

05 / PyPI release · how it works end-to-end

From local commit to pipx install.

1. git tag v0.3.0 LOCAL 2. git push --tags GITHUB ACTIONS TRIGGER 3. CI runs tests RYE SYNC + PYTEST 4. rye build WHEEL + SDIST 5. publish to PyPI OIDC TRUSTED PUBLISH pipx install ...
Fig. 2 — Tag → CI → PyPI → customer install

The minimal first-time setup

  1. PyPI account — register at pypi.org/account/register/; enable 2FA (required to publish).
  2. Reserve the name — first upload claims browser-recon (check pypi.org/project/browser-recon isn't taken; if it is, fall back to a scoped name like recon-scrapers or browser-recon-cli).
  3. Trusted publishing — in PyPI project settings > Publishing, register your GitHub repo + workflow filename. No API token to manage.
  4. Add the release workflow at .github/workflows/release.yml (see T53.8 for the YAML).
  5. Tag and pushgit tag v0.3.0 && git push --tags. CI does the rest.
  6. Smoke-testpipx install browser-recon on a clean machine; run recon --version.

Test the install before tagging

Build locally and install from the wheel before pushing the tag — fastest way to catch packaging bugs:

rye build
pipx install ./dist/browser_recon-0.3.0-py3-none-any.whl --force
recon --version
recon scan https://example.com

If the name is taken

Fall back to a scoped name. The Python package name and the CLI command name can differ — pipx install browser-recon-cli can still install a recon command. Adjust pyproject.toml's [project] name; [project.scripts] stays the same.

06 / Rollout sequence

Ship additive first, breaking last.

1T53.1 ships first. Pure additive: adds the events table + polling endpoint, wires emit_event into the existing pipeline. Customers immediately get the live spinner for current scans. Zero risk to the legacy flow.DAY 1
2T53.2 + T53.3 + T53.4 + T53.5 ship together behind feature flag BROWSER_RECON_SERVER_PIPELINE=1. New endpoint exists; old endpoint still works. Internal scans test the server-side flow end-to-end.DAYS 2-3
3T53.6 + T53.7 + T53.8 ship together. CLI v0.3.0 wheel built, tested on a clean machine, published to PyPI. Server flag flips to default-on for new uploads.DAY 4
4T53.9 cleans up: version-header logging, update-notice on CLI, dashboard banner for super-admin, email to active customers.DAY 5
5One release cycle later (v0.4.0): remove the legacy POST /scans/<id>/complete endpoint and the back-compat shim. Customers still on 0.2.x get a 410 Gone with the upgrade message.+ 2 WEEKS
07 / Risks + open questions

What to think about before starting.

RiskLikelihoodMitigation
PyPI name browser-recon already taken Low — check now Fall back to browser-recon-cli or similar; CLI command name (recon) unchanged
Server-side validation through cookies the user captured locally — those cookies were minted against the user's residential IP; firing them through a different proxy IP triggers 412 challenges (we saw this on Walmart) High Known limit, surface it in the recommendation. Cookie warmup remains a user-instructed step in the starter code. No code change required — just accurate copy.
Render request timeout (100s) on the new POST /capture endpoint if the blob is large Medium The endpoint returns 202 immediately after S3 upload + DB insert; the long work is in the background task. Capture upload itself is ~1s for a 5 MB blob.
Background task failure leaves scan in processing forever Medium Container-restart cleanup sweep marks orphaned processing rows older than 10 minutes as errored on startup. Same pattern as T48's cleanup_stale_running_evals.
AWS Secrets Manager rate-limit on hot scan paths Low lru_cache the secret read for the process lifetime; 5,000 GetSecretValue/sec default limit is far above any conceivable scan rate.
Customer's ~/.recon/config.toml not migrated — the v0.2.x config format may differ Low Audit before shipping; if the schema differs, ship a one-time migration in v0.3.0 startup that reads the old shape + rewrites the new.

Open decisions

ASSE vs polling for the events stream. Polling is what's specced; SSE (Server-Sent Events) is the next-up option if 1.5s lag becomes noticeable. Easy to swap later — same auth, same endpoint shape, different transport.DEFAULT POLLING
Brich vs textual for the spinner. rich is the lighter dep (already widely used), gives us Live + spinners. textual is a full TUI framework — overkill here.RICH
CSynchronous fallback for scripted use. A --wait flag (default true) makes the CLI block on completion; a --no-wait returns immediately with the scan_id so CI scripts can fire-and-poll on their own schedule.--WAIT DEFAULT
DCapture size limit. Pre-T53 captures average 2–10 MB. Pick a server-side hard cap (suggest 50 MB) and a CLI-side warning at 25 MB.50 MB HARD
08 / Customer install guide (preview)

Three commands and they're scanning.

pipx install browser-recon       # installs CLI in an isolated venv
recon login                       # enter your API key (saved to ~/.recon/config.toml)
recon scan https://walmart.com    # interactive scan with live progress

That's the entire customer-facing surface. No env vars to set. No proxy credentials to obtain. No Python packages to manage themselves (pipx isolates). Their machine runs Chrome and an HTTP client; everything else is yours.

If pipx isn't installed

python3 -m pip install --user pipx
python3 -m pipx ensurepath        # adds ~/.local/bin to PATH
pipx install browser-recon

Updating

pipx upgrade browser-recon

Uninstalling

pipx uninstall browser-recon