A redesign of the super-admin scan-debug surface — collapsing six tabs, a sandbox, and an eval matrix into a single per-step inspector with in-place re-run. Re-runs are isolated experiments; the original scan stays immutable.
scan_events, llm_calls, validation_runs) — no new persistence required. Re-runs always land as experiments; they never touch the original scan's rows.
GET /debug/scans/{id} with per-step Explorers at /debug/scans/{id}/steps/{idx}. The legacy tabbed debug page (?legacy=1) was retired in P5-T06.
Six tabs, a sandbox, an eval matrix, and an audit log all exist. They each work in isolation. But a Super Admin who wants to ask "what happened during this scan, step by step?" bounces between four routes and still can't see deterministic-step durations or per-validation-request details.
| Surface | What it does | Quality |
|---|---|---|
/debug/scans/{id} |
6 tabs: overview · capture · buckets · llm calls · evals · audit. Per-tab views of Scan, Report, LlmCall, AuditLog, LlmEval. |
comprehensive per-tab. No single timeline. |
/admin/sandbox/... |
Edit + re-run a single LLM prompt against a different model. Side-by-side diff. Daily $5 cap per admin. | scattered — requires navigating away from the scan. |
/admin/evals/{scan_id} |
Matrix: rows = prompts, columns = (model × provider). Each cell = one re-run. | undocumented — purpose unclear from the UI. |
/admin/scans |
Cross-user paginated scans list with filters + sort. | fine — entry point only. |
/admin/dashboard |
8 KPI cards: volume, cost, success rate, top domains, etc. | fine — aggregate, not per-scan. |
Deterministic steps (detection, analysis, scrub, render) only show as event rows. Duration / S3 / inputs / outputs aren't tabular alongside LLM and validation steps.
The data is there — validation_data.library_comparison records status + elapsed_ms + body_preview + block detection per library × proxy — but the UI shows only the rolled-up endpoint summary.
Sandbox lives at a separate route. Switching models or tweaking a prompt means two tabs open.
Validation fires ~75-100 requests per endpoint but admins can't refire a specific one to check whether the site's defenses changed.
The evals tab has no in-context explanation. Admins have to read the route file to know it's a model-comparison matrix.
Costs live per LLM call but never roll up to "synthesis cost = X" or "this scan's LLM budget by step".
Every redesign decision answers to these three primitives.
A single table of every micro-step. Replaces the 6-tab layout. Includes deterministic steps, LLM calls, validation phases, human gates. One row per micro-step.
One inspector page that renders differently per step type (LLM / validation / deterministic / gate / I/O). Pulls S3 lazily. Surfaces everything we already store.
Every Explorer has a "Re-run" panel. Side-by-side original vs new. Always an experiment — never mutates the production scan. Sandbox + eval-matrix collapse into this.
Replaces the existing overview tab. Renders by joining scan_events (timing), llm_calls (model + cost + S3), and validation_runs (per-endpoint data) on scan_id, ordered by created_at. No new table, no schema change.
| # | STEP | TYPE | STARTED | DURATION | MODEL | COST | STATUS | S3 | |
|---|---|---|---|---|---|---|---|---|---|
| 01 | capture | io | 00:00.0 | 2m 14s | — | — | ok | scans/d97../capture.gz | explore → |
| 02 | detection | deterministic | 02:14.1 | 0.42s | — | — | ok | — | explore → |
| 03 | analysis | deterministic | 02:14.6 | 0.81s | — | — | ok | — | explore → |
| 04 | flow_confirm | llm | 02:15.5 | 3.21s | claude-sonnet-4-6 | $0.0084 | ok | llm-calls/d97../flow_confirm-.. | explore → |
| 05 | [gate] awaiting_confirmation | human | 02:18.7 | 22.4s | — | — | confirmed | — | explore → |
| 06 | intent_filter | llm | 02:41.2 | 4.55s | claude-sonnet-4-6 | $0.0142 | ok | llm-calls/d97../intent_filter-.. | explore → |
| 07 | validation | validation | 02:45.8 | 1m 32s | — | — | partial | 5 endpoints · 387 requests | explore → |
| 08 | scrub | deterministic | 04:17.9 | 0.18s | — | — | ok | scans/d97../scrubbed.gz | explore → |
| 09 | synthesis | llm | 04:18.1 | 14.6s | claude-sonnet-4-6 | $0.1483 | ok | llm-calls/d97../synthesis-.. | explore → |
| 10 | ↳ notes | llm | 04:18.1 | 3.8s | claude-haiku-4-5 | $0.0049 | ok | llm-calls/d97../notes-.. | explore → |
| 11 | ↳ difficulty_drivers | llm | 04:18.1 | 4.2s | claude-haiku-4-5 | $0.0061 | ok | llm-calls/d97../difficulty.. | explore → |
| 12 | replay | validation | 04:33.0 | 4.1s | — | — | ok | 5 endpoints | explore → |
| 13 | render | deterministic | 04:37.2 | 0.84s | — | — | ok | reports/{id}.html | explore → |
The trace shape is computable today. Step durations come from pairing started / complete rows on scan_events. LLM rows nest under their parent step by matching llm_calls.created_at to the enclosing step window. Validation phases collapse into one parent validation row with per-endpoint detail surfaced inside the Explorer.
Every Explore → opens the same URL shape (/debug/scans/{scan_id}/steps/{step_idx}) but the renderer branches on step type. The renderings differ; the chrome and Re-run panel are shared.
# Loaded from s3 lazily on first view You are a senior scraping consultant… The user has captured network traffic from a target site. Produce a JSON object with three top-level keys: - recommendation - verdict - starter_code [truncated · 4,124 tokens · expand ↧]
# 27,838 tokens · 90% cache hit target_url: https://www.officedepot.com intent: Get all categories. Then for each category, get all products and then for each product, get data and reviews detection_findings: - vendor: akamai_bot_manager severity: presence confidence: 1.0 evidence: [_abck, ak_bmsc, bm_sz, bm_sv] [truncated · 27,838 tokens · expand ↧]
{
"recommendation": {
"primary_library": "curl_cffi",
"confidence": 0.86,
"proxy_type": "residential",
"cost_band": "$0.40-$2.50 per 1K",
"rationale": "Akamai with full sensor stack..."
},
"verdict": { ... },
"starter_code": { ... }
}
[expand raw response ↧]
| # | METHOD · URL | BUCKET | BEST LIB | HEADERS REQ. | COOKIES | RATE | STATUS | |
|---|---|---|---|---|---|---|---|---|
| ep_001 | GET /api/v3/catalog/categories | A | curl_cffi | 5 of 11 | warmup | 2.1 r/s | complete | expand → |
| ep_002 | GET /api/v3/catalog/products/<id> | A | curl_cffi | 5 of 11 | warmup | 1.8 r/s | complete | expand → |
| ep_003 | GET /api/v3/product/<id>/reviews | A | — | — | — | — | blocked all | expand → |
| ep_004 | GET /sb/v3/session/init | B | requests | 2 of 7 | cold | — | complete | expand → |
| LIB | PROXY | STATUS | ELAPSED | BLOCK | BODY PREVIEW | |
|---|---|---|---|---|---|---|
| requests | datacenter | 403 | 184ms | akamai | "Access Denied"… | replay → |
| requests | residential ×3 | 403 · 403 · 403 | avg 311ms | akamai | "Access Denied"… | replay → |
| httpx | datacenter | 403 | 201ms | akamai | "Access Denied"… | replay → |
| httpx | residential ×3 | 403 · 403 · 403 | avg 287ms | akamai | "Access Denied"… | replay → |
| curl_cffi | datacenter | 403 | 219ms | akamai | "Access Denied"… | replay → |
| curl_cffi | residential ×3 | 200 · 200 · 200 | avg 412ms | none | {"categories":[… | replay → |
METHOD GET URL https://www.officedepot.com/api/v3/catalog/categories LIB curl_cffi ▾ // editable PROXY residential ▾ // editable HEADERS user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)… accept: application/json accept-language: en-US,en;q=0.9 x-request-id: [scrubbed] x-csrf-token: [scrubbed] // supply value to fire EXPECT status 200 · ~12 KB JSON · top-level keys: [categories, meta]
| VENDOR | SEVERITY | CONFIDENCE | EVIDENCE |
|---|---|---|---|
| akamai_bot_manager | presence | 1.00 | _abck, ak_bmsc, bm_sz, bm_sv |
| cloudflare | tier 1 | 0.30 | cf-ray header on static assets |
| auth_flow | login_endpoint | 1.00 | POST /account/login |
| pagination | query_param | 1.00 | page, limit on /catalog/* |
The scrub step strips secrets from the captured blob before anything downstream reads it. The Explorer surfaces the manifest deltas so an admin can audit what got redacted (and crucially, what didn't).
| FIELD | BEFORE | AFTER | RULE |
|---|---|---|---|
| cookie._abck | B7vN1f…83 chars | <scrubbed> | cookie_value |
| header.authorization | Bearer eyJ0eXAiOiJKV1Q… | <scrubbed> | header_allowlist |
| header.x-csrf-token | 9f3a-… | <scrubbed> | header_suffix |
| url.query.email | user@example.com | <scrubbed> | url_param_name |
| body.user.password | "swordfish" | "<string:9>" | json_shape |
After synthesis the pipeline fires the recommended endpoints through the recommended library to confirm the advice still works against the live site. This is distinct from per-request replay inside the Validation Explorer (§4·b) — that's an admin-triggered diagnostic; this is an automatic verification baked into the pipeline.
| # | METHOD · URL | STATUS | MATCHED SHAPE | REASON | |
|---|---|---|---|---|---|
| ep_001 | GET /api/v3/catalog/categories | 200 | yes | top-level keys match: [categories, meta] | diff → |
| ep_002 | GET /api/v3/catalog/products/<id> | 200 | yes | top-level keys match: [product, related] | diff → |
| ep_004 | GET /sb/v3/session/init | 200 | no | expected [token, expires] · got [token, expires_at, refresh] | diff → |
replay_history cooldown · cost-capped
The Trace renders different visual states depending on the scan's outcome. The admin-scans list (per decision below) filters to terminal states only, but the Explorer route is direct-URL accessible — so we handle the edge case too.
Steps before the failure render normally. The failed step shows a errored pill in danger color with the error message inline (truncated to 200 chars; full text in an evidence accordion). Steps after render as not reached with no timings or costs. Page header: "Scan errored at step 7 (validation)" in danger color.
Steps before cancellation render normally. The cancellation point shows cancelled by <actor> in warn color with timestamp. Subsequent steps not reached. Header: "Scan cancelled at step 9 (synthesis) by siddhant · 14:31".
Common case: synthesis combined call OK, but notes or difficulty_drivers failed. Sub-step marked partial. Parent synthesis row also partial. Trace continues normally past it. Header: "Scan completed · 1 subsection failed".
Hidden from /admin/scans list by default (decision below). If admin reaches it via direct URL, render a placeholder: "Scan in progress at step 7 (validation) — refresh to update" with an auto-refresh meta tag (30s) and a manual refresh button.
Every re-run lands in an experiment table (existing LlmExperiment for LLM steps; proposed validation_replays for validation requests). The original LlmCall / ValidationRun row is immutable. The Explorer reads both and renders them side by side.
The Explorer's Re-run is side-channel only. The scan's report at /reports/{id} shows the original synthesis forever — nothing the admin does in the Explorer changes that. This is intentional: experiments are evidence of what's been tried, not a back-door for replacing production output.
There is a separate user-facing path that does overwrite the production synthesis: POST /scans/{id}/rerun-prompt. That's the existing "free retry on a failed LLM section" feature, capped at 3× per section by Scan.rerun_counts. It's unchanged by this redesign and lives in a different surface (the user's own scan page, not the admin debug page).
| Step type | Re-run knobs | Stored in | Cost attribution |
|---|---|---|---|
| llm | Model picker · edit system/user prompts · n-replicas for variance | llm_experiments (existing) + S3 dump at debugger_data/{exp_id}.json.gz |
Per-admin daily $5 cap. Re-run cost goes to the admin's budget, not the scan's total. |
| validation | Library · proxy tier · header edits · payload edits | validation_replays (new — small table, see §07) |
Per-admin daily count cap (50/day). Proxy bandwidth + LLM block-detection cost charged to the admin. |
| deterministic / scrub / replay | None — runs current code against original input | Computed-on-fly diff against original output (no row persisted in v1) | Free — local compute only. |
| gate | Force confirm / abandon | Audit log only; affects the scan's state machine | Rate-limited per-admin. |
/admin/experiments listing every experiment across every scan.Experiments are cheap (gzipped JSON in S3 + a small DB row) and they're evidence. Default behaviour is keep. But admins occasionally want to clear noise — failed prompt edits, accidental fires. So:
| Action | Who can do it | Effect | Audit |
|---|---|---|---|
| Soft-delete experiment | Admin who created it | Sets deleted_at. Row renders struck-through with (deleted YYYY-MM-DD) badge for 30 days. Reversible during grace window. |
experiment_soft_delete |
| Restore | Admin who deleted it (within grace) | Clears deleted_at. Row reappears. |
experiment_restore |
| Hard-delete | Super-admin only | Drops DB row + S3 dump. Irreversible. | experiment_hard_delete with required reason field |
| Automatic cleanup | Sweeper | Same TTL as parent scan. When the scan hard-deletes (retention expiry), experiments cascade-delete with it. | retention_cascade_delete |
The LLM Explorer (§4·a) shows four buttons — + haiku-4-5, + opus-4-7, + gpt-5, + grok-3 — that fire a re-run with the original prompts unchanged. This replaces the entire eval-matrix UI for the common case: "how would this prompt look on a cheaper / smarter / different-provider model?"
Nothing is deleted. The infrastructure that backs sandbox and evals (LlmExperiment, LlmEval, eval_runner, eval_lock, daily cost cap) is reused as the engine behind the Step Explorer's Re-run panel.
| Today | Tomorrow | What changes |
|---|---|---|
/admin/sandbox/from-call/{id} retire route |
Re-run panel on LLM Explorer (§4·a) | Editor opens inline below the step. LlmExperiment rows still get written. |
/admin/sandbox/{exp_id} retire route |
Comparison table on LLM Explorer | Every experiment shows up as a row under the step it was forked from. |
/admin/sandbox/list repurpose |
/admin/experiments — cross-scan index |
Filterable by step, model, admin, scan. Same data, different framing. |
/admin/evals/{scan_id} retire matrix |
Quick-compare buttons + comparison table on the LLM Explorer | The "rows = prompts, cols = models" matrix is replaced by per-step + model buttons. |
/admin/evals index repurpose |
Cross-scan eval campaigns | The place to run a fixture-based regression eval — same prompt × many scans. In-page explainer above the matrix. |
llm_calls · llm_experiments · llm_evals tables keep |
Unchanged | The Explorer reads from all three. |
Today the word eval means two things at once: "compare this scan's prompts across models" (a single-scan question) and "run a fixture-based regression test of our prompts" (a cross-scan question). The redesign splits them:
/admin/evals route, but the landing page explains: "Pick a prompt and a fixture set. Run all current models against it. See cost / quality / drift over time." Plain-language in-page explainer above the matrix.Most of the data the Explorer needs is already in the DB — scan_events for timing, llm_calls for prompts and cost, validation_runs for per-endpoint data. The Trace and Explorer ship without a single schema change. Only the new admin-triggered features (replay, deterministic re-run) need new tables.
| Data point | Lives in | How the Explorer uses it |
|---|---|---|
| Step start / end times | scan_events.created_at on started / complete rows (Python-side wall-clock) |
Pair the rows → step duration. Already correct after T53.11 fix 1a. |
| Every LLM call | llm_calls row + S3 dump at s3_prompt_path / s3_response_path |
Trace shows model + cost + latency. Explorer lazy-loads S3 for prompt/response text. |
| LLM cost + tokens | llm_calls.cost_usd, input_tokens, output_tokens, cache_read_tokens, cache_write_tokens |
Per-step cost rollup is SUM(cost_usd) WHERE created_at BETWEEN step.started AND step.ended. |
| Per-endpoint validation | validation_runs.validation_data JSONB · llm_view / report_view projections |
Per-endpoint table. |
| Per-request validation detail | validation_runs.validation_data.library_comparison — every fire (lib × proxy) records status, elapsed_ms, body_preview, block detection |
The Explorer's "expand → per-request matrix" UI is a direct render of this JSON. |
| Capture blob | scan.captured_blob_url (S3 gzipped JSON) |
Step 01 (capture) explorer link. |
| Scrubbed blob | Same S3 path, overwritten after scrub step (manifest in scan.scrub_manifest) |
Step 08 (scrub) explorer. |
| Rendered report | report.rendered_html (or S3 cache) |
Step 13 (render) explorer + sub-page link. |
| Replay confidence | scan.replay_confidence + scan.replay_per_endpoint |
Step 12 (replay) explorer. |
| Audit log | audit_log rows where resource_id = scan_id |
Sub-page link. |
validation_replays tableOne row per admin-triggered replay attempt. Columns: id, scan_id, endpoint_id, library, proxy_tier, request_headers (jsonb), request_body, status, elapsed_ms, body_preview, fired_by, fired_at, deleted_at. Lets the Explorer show replay history and detect site drift.
llm_calls.parent_stepNullable TEXT column. When set, the Trace nests the row under that parent step (↳ notes shape). Without it, infer nesting from prompt_name heuristics. Ships either way.
scan_events.event_metadata.duration_msConvention, not a column change. On every complete event, write {"duration_ms": N} into the existing JSON column. Saves a pair-join on render. Lazy backfill via on-render compute.
deterministic_reruns tableIf we want a persistent log of "ran detection again with current code on 2026-05-30 — diff produced 2 changed findings", we need a row type for it. Alternative: compute-on-the-fly each time admin clicks. Decide in P3 (open Q·05).
body_preview truncated to ~500 chars). That's fine — the Explorer reflects what we have, and a "replay" gives the live current response.High-level shape here. The task-by-task TODO list lives in the companion document: html_docs/admin-ui-rebuild-plan.html.
| Phase | Scope | Ships | Risk |
|---|---|---|---|
| P1 | Step Trace | New /debug/scans/{id} overview view. LLM ledger and audit log become sub-pages. Existing tabs stay accessible behind ?legacy=1 for a release. |
low · read-only · joins three tables that all have scan_id indexed. |
| P2 | LLM Explorer | /debug/scans/{id}/steps/{idx} rendering for LLM steps. Lazy S3 fetch. Re-run panel reuses sandbox.py internals as a service call. |
low · wraps existing sandbox infra. |
| P3 | Validation Explorer | Same route, validation rendering. Per-endpoint expansion uses validation_runs.validation_data. Replay needs the new validation_replays table + a thin service that wraps the validator's library/proxy pool. |
medium · replay fires real HTTP. Cost cap + audit + super-admin gating. |
| P4 | Cleanup | Retire /admin/sandbox/* routes (redirect to step page). Repurpose /admin/evals with in-page explainer. Drop ?legacy=1 escape. |
low · mostly route deletion + redirects. |
Parallel to this, an independent track ships the LLM Models Registry — an admin page to manage every model (company, version, pricing, frontier/reasoning flags, usage stats). Documented in the companion build plan.
Every question raised in earlier revisions has been answered. Locked-in decisions below — each shapes one or more build-plan tasks.
| # | Question | Decision |
|---|---|---|
| D · 01 | Replay cost / safety | decided 50 replays per admin per day. No dry-run mode initially — admin can pick datacenter proxy to keep cost low. |
| D · 02 | Scrubbed-value supply | decided Replay form shows [scrubbed] placeholders. Admin types value OR clicks "fire warmup first" to harvest fresh values. Both paths visible. |
| D · 03 | Re-run cost attribution | decided Per-admin account, always. Re-run costs charged to the acting admin's budget — never bleed into the original scan's total. Existing $5/day cap unchanged. |
| D · 04 | Deterministic re-run scope | decided Ship in P3 for detection / analysis / scrub / replay. Useful when rule code has changed. |
| D · 05 | Deterministic re-run persistence | decided Compute-on-fly. No new table. If admin wants a permanent comparison, they screenshot or export. |
| D · 06 | Inline prompt edits | decided Free-form text in v1. Structured diff later if mistakes happen. |
| D · 07 | Permission boundary | decided View routes admin OR owner. Mutation routes (rerun, replay, model edit) super-admin only. |
| D · 08 | Experiment delete behaviour | decided Soft-delete by owner-admin with 30-day grace. Hard-delete super-admin only with required reason. Cascade-delete with parent scan retention. |
| D · 09 | Models registry fallback | decided Keep hardcoded pricing table as compile-time fallback if DB is empty. Migration seeds on first deploy. |
| D · 10 | Scan list filter | decided /admin/scans shows only completed / errored / cancelled scans by default. In-progress scans hidden (toggle to include). Direct URL access still works for any state. |
| D · 11 | Step coverage | decided Build Explorer pages for LLM · validation · deterministic (detection/analysis) · scrub · replay · gate. Skip dedicated Explorer for capture and render — their Trace rows just link to S3 raw / report HTML. |
| D · 12 | Explorer presentation | decided Dedicated full-page route per step (/debug/scans/{id}/steps/{idx}) with sticky breadcrumb back to Trace. Not a modal or drawer — content needs full viewport for LLM and validation views, and deep-linking matters for sharing. |
| D · 13 | Error / partial / cancelled states | decided See §4·g. Steps before failure render normally · failure point in danger color with inline error · subsequent steps as not reached. Partial scans mark failed substeps with partial pill. |
| D · 14 | Async-action notifications | decided Browser Notification API (asks permission on first use, works when tab backgrounded) + in-page toast fallback when permission denied. Triggers: LLM re-run completion (10-15s) · replay completion (2-5s) · model-archive confirmation. |
| D · 15 | Help affordances | decided ? tooltip icon next to every non-obvious panel header + every form field. CSS-only on hover (no JS dep). Same pattern across Trace, Explorer, Registry. |
| D · 16 | Audit coverage extension | decided Existing scan_view_admin stays. Add llm_dump_view (lazy S3 fetch) and replay_form_open (admin opens the replay editor for a non-own scan). Both write to audit_log. |
| D · 17 | Model archive safeguard | decided Archive is blocked if the model is the resolved target for any BROWSER_RECON_LLM_MODEL_* env var or any prompt's default. Admin must unset the env var / change the default first. UI shows which env vars / prompts pin the model. |
None — every question raised has been resolved. New decisions get logged here as they're made.