Redesign Proposal / scan-debug v2 / 2026-05-19 · rev 2 / super-admin internal

One trace, one explorer, one re-run.

A redesign of the super-admin scan-debug surface — collapsing six tabs, a sandbox, and an eval matrix into a single per-step inspector with in-place re-run. Re-runs are isolated experiments; the original scan stays immutable.

Today
6 tabs · 3 routes
Proposed
1 timeline · 1 explorer
Step trace storage
already in DB
New tables
1 small · validation_replays
Re-runs mutate scan?
no · always experiments
Verdict
Build a Step Trace — one table that lists every micro-step the pipeline took, with timing / cost / model / S3 location inline, and an Explore CTA per row. Each Explore opens a step-typed inspector with a built-in Re-run panel. The trace itself is a view over three existing tables (scan_events, llm_calls, validation_runs) — no new persistence required. Re-runs always land as experiments; they never touch the original scan's rows.
IMPLEMENTED · 2026-05-20  This proposal shipped. The Step Trace lives at GET /debug/scans/{id} with per-step Explorers at /debug/scans/{id}/steps/{idx}. The legacy tabbed debug page (?legacy=1) was retired in P5-T06.
01 Today's surface

It's not missing functionality. It's missing connective tissue.

Six tabs, a sandbox, an eval matrix, and an audit log all exist. They each work in isolation. But a Super Admin who wants to ask "what happened during this scan, step by step?" bounces between four routes and still can't see deterministic-step durations or per-validation-request details.

What exists today

SurfaceWhat it doesQuality
/debug/scans/{id} 6 tabs: overview · capture · buckets · llm calls · evals · audit. Per-tab views of Scan, Report, LlmCall, AuditLog, LlmEval. comprehensive per-tab. No single timeline.
/admin/sandbox/... Edit + re-run a single LLM prompt against a different model. Side-by-side diff. Daily $5 cap per admin. scattered — requires navigating away from the scan.
/admin/evals/{scan_id} Matrix: rows = prompts, columns = (model × provider). Each cell = one re-run. undocumented — purpose unclear from the UI.
/admin/scans Cross-user paginated scans list with filters + sort. fine — entry point only.
/admin/dashboard 8 KPI cards: volume, cost, success rate, top domains, etc. fine — aggregate, not per-scan.

What a Super Admin can't currently do

GAP · 01

See a unified step timeline

Deterministic steps (detection, analysis, scrub, render) only show as event rows. Duration / S3 / inputs / outputs aren't tabular alongside LLM and validation steps.

GAP · 02

Drill into a single validation request

The data is there — validation_data.library_comparison records status + elapsed_ms + body_preview + block detection per library × proxy — but the UI shows only the rolled-up endpoint summary.

GAP · 03

Re-run a step without leaving the scan

Sandbox lives at a separate route. Switching models or tweaking a prompt means two tabs open.

GAP · 04

Replay a real HTTP request

Validation fires ~75-100 requests per endpoint but admins can't refire a specific one to check whether the site's defenses changed.

GAP · 05

Understand what each feature does

The evals tab has no in-context explanation. Admins have to read the route file to know it's a model-comparison matrix.

GAP · 06

See cost broken down by step

Costs live per LLM call but never roll up to "synthesis cost = X" or "this scan's LLM budget by step".

02 Design principle

One trace. One explorer. One re-run.

Every redesign decision answers to these three primitives.

PRIMITIVE · 01

The Trace

A single table of every micro-step. Replaces the 6-tab layout. Includes deterministic steps, LLM calls, validation phases, human gates. One row per micro-step.

PRIMITIVE · 02

The Explorer

One inspector page that renders differently per step type (LLM / validation / deterministic / gate / I/O). Pulls S3 lazily. Surfaces everything we already store.

PRIMITIVE · 03

The Re-run

Every Explorer has a "Re-run" panel. Side-by-side original vs new. Always an experiment — never mutates the production scan. Sandbox + eval-matrix collapse into this.

03 The Step Trace

What the new overview looks like.

Replaces the existing overview tab. Renders by joining scan_events (timing), llm_calls (model + cost + S3), and validation_runs (per-endpoint data) on scan_id, ordered by created_at. No new table, no schema change.

02.1 — SCAN DEBUG
Scan d9758303
https://www.officedepot.com · lazycoder.codes@gmail.com · completed in 4m 38s · total cost $0.187
STEP TRACE
#STEPTYPESTARTEDDURATIONMODELCOSTSTATUSS3
01captureio00:00.02m 14sokscans/d97../capture.gzexplore →
02detectiondeterministic02:14.10.42sokexplore →
03analysisdeterministic02:14.60.81sokexplore →
04flow_confirmllm02:15.53.21sclaude-sonnet-4-6$0.0084okllm-calls/d97../flow_confirm-..explore →
05[gate] awaiting_confirmationhuman02:18.722.4sconfirmedexplore →
06intent_filterllm02:41.24.55sclaude-sonnet-4-6$0.0142okllm-calls/d97../intent_filter-..explore →
07validationvalidation02:45.81m 32spartial5 endpoints · 387 requestsexplore →
08scrubdeterministic04:17.90.18sokscans/d97../scrubbed.gzexplore →
09synthesisllm04:18.114.6sclaude-sonnet-4-6$0.1483okllm-calls/d97../synthesis-..explore →
10 ↳ notesllm04:18.13.8sclaude-haiku-4-5$0.0049okllm-calls/d97../notes-..explore →
11 ↳ difficulty_driversllm04:18.14.2sclaude-haiku-4-5$0.0061okllm-calls/d97../difficulty..explore →
12replayvalidation04:33.04.1sok5 endpointsexplore →
13renderdeterministic04:37.20.84sokreports/{id}.htmlexplore →
ROLL-UP
LLM cost by step
synthesis (combined T16)$0.1483 · 79%
intent_filter$0.0142 · 8%
flow_confirm$0.0084 · 4%
difficulty_drivers$0.0061 · 3%
notes$0.0049 · 3%
total$0.1869
step duration breakdown
capture (user-driven)2m 14s · 48%
validation1m 32s · 33%
awaiting_confirmation (gate)22.4s · 8%
synthesis + aux14.6s · 5%
deterministic (det · scrub · render)2.2s · 1%
other12.3s · 5%
SUB-PAGES
full report → capture timeline → buckets → audit log → raw json (debug payload) →

The trace shape is computable today. Step durations come from pairing started / complete rows on scan_events. LLM rows nest under their parent step by matching llm_calls.created_at to the enclosing step window. Validation phases collapse into one parent validation row with per-endpoint detail surfaced inside the Explorer.

What the Trace replaces: the existing overview, capture, buckets, and llm calls tabs collapse into this one view plus their respective Explorer destinations. The audit tab survives as a link in the sub-pages strip. Evals is retired (see §06).
04 The Step Explorer

One inspector, four renderings.

Every Explore → opens the same URL shape (/debug/scans/{scan_id}/steps/{step_idx}) but the renderer branches on step type. The renderings differ; the chrome and Re-run panel are shared.

4·a · LLM step (synthesis, intent_filter, flow_confirm, notes, difficulty_drivers)

02.1 — SCAN DEBUG / STEP 09 / SYNTHESIS
Step 09 — synthesis · combined T16 prompt
claude-sonnet-4-6 (anthropic) · 14.6s · $0.1483 · 32,184 in / 4,221 out · cache_read 28,910
TIMING + METADATA
prompt_namescan_synthesis
prompt_versionv3
started_at2026-05-19 14:23:11.412 UTC
ended_at2026-05-19 14:23:26.012 UTC
elapsed14.6s · server-measured
retry_count0
s3_prompt_paths3://recon-llm/llm-calls/d97../synthesis-2026-05-19T14-23-11Z.json.gz
db_rowllm_calls.id = 7c2f-… · copy curl
PROMPT + RESPONSE
system prompt
# Loaded from s3 lazily on first view
You are a senior scraping consultant…

The user has captured network traffic from a target site.
Produce a JSON object with three top-level keys:
  - recommendation
  - verdict
  - starter_code

[truncated · 4,124 tokens · expand ↧]
user prompt
# 27,838 tokens · 90% cache hit
target_url: https://www.officedepot.com
intent: Get all categories. Then for each category, get
         all products and then for each product, get data
         and reviews

detection_findings:
  - vendor: akamai_bot_manager
    severity: presence
    confidence: 1.0
    evidence: [_abck, ak_bmsc, bm_sz, bm_sv]

[truncated · 27,838 tokens · expand ↧]
parsed response (structured output)
{
  "recommendation": {
    "primary_library": "curl_cffi",
    "confidence": 0.86,
    "proxy_type": "residential",
    "cost_band": "$0.40-$2.50 per 1K",
    "rationale": "Akamai with full sensor stack..."
  },
  "verdict": { ... },
  "starter_code": { ... }
}
[expand raw response ↧]
RE-RUN (always an experiment, never mutates the scan)
experiment with a different model or edited prompt
MODEL
claude-sonnet-4-6 ▾ (original)
QUICK COMPARE
PROMPT EDIT
edit system / user prompts inline · changes diffed vs original ▾
▶ run est. cost $0.15 · daily budget $4.20 / $5.00
COMPARISONS · 2 previous re-runs
WHENMODELEDITSCOSTLATENCYOUTCOMEFIRED BY
14:31 today claude-haiku-4-5 none $0.0214 5.8s divergent siddhant diff →
14:33 today gpt-5 none $0.0892 9.2s aligned siddhant diff →

4·b · Validation step

02.1 — SCAN DEBUG / STEP 07 / VALIDATION
Step 07 — validation · 5 endpoints · 387 requests fired
1m 32s · partial · 3 endpoints passed · 2 blocked at all tiers
ENDPOINTS
#METHOD · URLBUCKETBEST LIBHEADERS REQ.COOKIESRATESTATUS
ep_001GET /api/v3/catalog/categoriesAcurl_cffi5 of 11warmup2.1 r/scompleteexpand →
ep_002GET /api/v3/catalog/products/<id>Acurl_cffi5 of 11warmup1.8 r/scompleteexpand →
ep_003GET /api/v3/product/<id>/reviewsAblocked allexpand →
ep_004GET /sb/v3/session/initBrequests2 of 7coldcompleteexpand →
EXPANDED · ep_001
library_compare · phase A · 6 fires
LIBPROXYSTATUSELAPSEDBLOCKBODY PREVIEW
requestsdatacenter403184msakamai"Access Denied"…replay →
requestsresidential ×3403 · 403 · 403avg 311msakamai"Access Denied"…replay →
httpxdatacenter403201msakamai"Access Denied"…replay →
httpxresidential ×3403 · 403 · 403avg 287msakamai"Access Denied"…replay →
curl_cffidatacenter403219msakamai"Access Denied"…replay →
curl_cffiresidential ×3200 · 200 · 200avg 412msnone{"categories":[…replay →
header_reduce · 8 headers tested · 5 required
user-agentrequired (baseline)
acceptrequired (drop → 403)
accept-languagerequired
x-request-idrequired
x-csrf-tokenrequired (cookies_required also)
sec-fetch-modeoptional (drop → still 200)
refereroptional
accept-encodingalways-optional · not tested
REPLAY A REQUEST (admin · cost-capped · audit-logged)
fire this exact request again, now
METHOD  GET
URL     https://www.officedepot.com/api/v3/catalog/categories
LIB     curl_cffi ▾      // editable
PROXY   residential ▾    // editable
HEADERS
  user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)…
  accept: application/json
  accept-language: en-US,en;q=0.9
  x-request-id: [scrubbed]
  x-csrf-token: [scrubbed]          // supply value to fire

EXPECT  status 200 · ~12 KB JSON · top-level keys: [categories, meta]
▶ fire edit headers compare with original last replay 14m ago · still 200 · 47 / 50 replays today
REPLAY HISTORY · 3 prior fires
WHENLIB / PROXYSTATUSELAPSEDFIRED BY
14m agocurl_cffi / residential200418mssiddhantdiff →
2h agocurl_cffi / residential200402mssiddhantdiff →
3d agocurl_cffi / datacenter403189mssiddhantdiff →

4·c · Deterministic step (detection, analysis, scrub, render)

02.1 — SCAN DEBUG / STEP 02 / DETECTION
Step 02 — detection · 8 rule modules · pure code
0.42s · ok · 7 findings produced
INPUT
sourcescans/d97../capture.gz · 4.2 MB gzipped
rules_versionv4
artifactsapi_endpoints (164), cookies (37), interactions (22), websockets (0)
OUTPUT · findings
VENDORSEVERITYCONFIDENCEEVIDENCE
akamai_bot_managerpresence1.00_abck, ak_bmsc, bm_sz, bm_sv
cloudflaretier 10.30cf-ray header on static assets
auth_flowlogin_endpoint1.00POST /account/login
paginationquery_param1.00page, limit on /catalog/*
RE-RUN (experiment · against current code)
run detection again — useful for testing rule changes
▶ re-run detection uses current code · diff against original output · never overwrites

4·d · Human gate (awaiting_confirmation)

02.1 — SCAN DEBUG / STEP 05 / AWAITING_CONFIRMATION
Step 05 — flow confirmation gate
paused 22.4s · user confirmed
flow_confirm verdict (from step 04)
matches_intenttrue · confidence 0.92
flow_summaryUser loaded homepage → navigated to "Office Supplies" category → viewed 3 products → opened reviews tab on each
user_decisionconfirmed · 22.4s after gate opened
decision_routePOST /scans/d97../confirm

4·e · Scrub step

The scrub step strips secrets from the captured blob before anything downstream reads it. The Explorer surfaces the manifest deltas so an admin can audit what got redacted (and crucially, what didn't).

02.1 — SCAN DEBUG / STEP 08 / SCRUB
Step 08 — scrub · rules version 4
0.18s · ok · 4.2MB capture → 4.1MB scrubbed
REDACTION MANIFEST
cookie_values_stripped47
auth_headers_stripped23
form_inputs_stripped8
url_params_scrubbed12
response_bodies_scrubbed164
response_bodies_dropped22 · binary / opaque
items_keptrequests 164 · cookies 37 · interactions 22
SAMPLE REDACTIONS · what got scrubbed
FIELDBEFOREAFTERRULE
cookie._abckB7vN1f…83 chars<scrubbed>cookie_value
header.authorizationBearer eyJ0eXAiOiJKV1Q…<scrubbed>header_allowlist
header.x-csrf-token9f3a-…<scrubbed>header_suffix
url.query.emailuser@example.com<scrubbed>url_param_name
body.user.password"swordfish""<string:9>"json_shape
RE-RUN (experiment · against current rules)
re-scrub with current code · check if a new rule catches anything missed
▶ re-run scrub diffs against original manifest · never overwrites stored blob

4·f · Replay step (pipeline replay, step 12)

After synthesis the pipeline fires the recommended endpoints through the recommended library to confirm the advice still works against the live site. This is distinct from per-request replay inside the Validation Explorer (§4·b) — that's an admin-triggered diagnostic; this is an automatic verification baked into the pipeline.

02.1 — SCAN DEBUG / STEP 12 / REPLAY
Step 12 — replay · best-effort verification
4.1s · ok · 5 endpoints · 4 matched expected shape · 1 differed
PER-ENDPOINT REPLAY
#METHOD · URLSTATUSMATCHED SHAPEREASON
ep_001GET /api/v3/catalog/categories200yestop-level keys match: [categories, meta]diff →
ep_002GET /api/v3/catalog/products/<id>200yestop-level keys match: [product, related]diff →
ep_004GET /sb/v3/session/init200noexpected [token, expires] · got [token, expires_at, refresh]diff →
CONFIDENCE
replay_confidence0.80 · 4/5 endpoints matched
replay_cooldownapplied · host last replayed 47s ago (cooldown 60s)
surfaced_on_reportyes · low-confidence banner suppressed
RE-RUN (experiment · live verification now)
fire the replay pass again — useful when checking whether site changed
▶ re-run replay subject to replay_history cooldown · cost-capped

4·g · Error · partial · cancelled · in-progress states

The Trace renders different visual states depending on the scan's outcome. The admin-scans list (per decision below) filters to terminal states only, but the Explorer route is direct-URL accessible — so we handle the edge case too.

STATE · errored

Failure mid-pipeline

Steps before the failure render normally. The failed step shows a errored pill in danger color with the error message inline (truncated to 200 chars; full text in an evidence accordion). Steps after render as not reached with no timings or costs. Page header: "Scan errored at step 7 (validation)" in danger color.

STATE · cancelled

User or admin cancelled

Steps before cancellation render normally. The cancellation point shows cancelled by <actor> in warn color with timestamp. Subsequent steps not reached. Header: "Scan cancelled at step 9 (synthesis) by siddhant · 14:31".

STATE · partial

One subsection failed

Common case: synthesis combined call OK, but notes or difficulty_drivers failed. Sub-step marked partial. Parent synthesis row also partial. Trace continues normally past it. Header: "Scan completed · 1 subsection failed".

STATE · in-progress

Direct URL access only

Hidden from /admin/scans list by default (decision below). If admin reaches it via direct URL, render a placeholder: "Scan in progress at step 7 (validation) — refresh to update" with an auto-refresh meta tag (30s) and a manual refresh button.

05 Re-run + compare

Always an experiment. Never a mutation.

Every re-run lands in an experiment table (existing LlmExperiment for LLM steps; proposed validation_replays for validation requests). The original LlmCall / ValidationRun row is immutable. The Explorer reads both and renders them side by side.

CRITICAL CONTRACT

The Explorer's Re-run is side-channel only. The scan's report at /reports/{id} shows the original synthesis forever — nothing the admin does in the Explorer changes that. This is intentional: experiments are evidence of what's been tried, not a back-door for replacing production output.

There is a separate user-facing path that does overwrite the production synthesis: POST /scans/{id}/rerun-prompt. That's the existing "free retry on a failed LLM section" feature, capped at 3× per section by Scan.rerun_counts. It's unchanged by this redesign and lives in a different surface (the user's own scan page, not the admin debug page).

Re-run mechanics per step type

Step typeRe-run knobsStored inCost attribution
llm Model picker · edit system/user prompts · n-replicas for variance llm_experiments (existing) + S3 dump at debugger_data/{exp_id}.json.gz Per-admin daily $5 cap. Re-run cost goes to the admin's budget, not the scan's total.
validation Library · proxy tier · header edits · payload edits validation_replays (new — small table, see §07) Per-admin daily count cap (50/day). Proxy bandwidth + LLM block-detection cost charged to the admin.
deterministic / scrub / replay None — runs current code against original input Computed-on-fly diff against original output (no row persisted in v1) Free — local compute only.
gate Force confirm / abandon Audit log only; affects the scan's state machine Rate-limited per-admin.
Cost attribution rule: any admin-triggered action (LLM re-run, validation replay, scrub re-run, replay re-run) draws from the acting admin's budget, never the original scan's. The scan's "total cost" on the Trace stays frozen at the original pipeline cost — re-runs appear in the comparison table with their own cost lines.

Every test is stored and visible

Delete policy

Experiments are cheap (gzipped JSON in S3 + a small DB row) and they're evidence. Default behaviour is keep. But admins occasionally want to clear noise — failed prompt edits, accidental fires. So:

ActionWho can do itEffectAudit
Soft-delete experiment Admin who created it Sets deleted_at. Row renders struck-through with (deleted YYYY-MM-DD) badge for 30 days. Reversible during grace window. experiment_soft_delete
Restore Admin who deleted it (within grace) Clears deleted_at. Row reappears. experiment_restore
Hard-delete Super-admin only Drops DB row + S3 dump. Irreversible. experiment_hard_delete with required reason field
Automatic cleanup Sweeper Same TTL as parent scan. When the scan hard-deletes (retention expiry), experiments cascade-delete with it. retention_cascade_delete
UX intent. No "x" button on the experiment row. Delete is behind an explicit "manage" action so admins don't lose history with one accidental click.

Quick-compare shortcut

The LLM Explorer (§4·a) shows four buttons — + haiku-4-5, + opus-4-7, + gpt-5, + grok-3 — that fire a re-run with the original prompts unchanged. This replaces the entire eval-matrix UI for the common case: "how would this prompt look on a cheaper / smarter / different-provider model?"

06 Where sandbox + evals go

Existing surfaces, redistributed.

Nothing is deleted. The infrastructure that backs sandbox and evals (LlmExperiment, LlmEval, eval_runner, eval_lock, daily cost cap) is reused as the engine behind the Step Explorer's Re-run panel.

TodayTomorrowWhat changes
/admin/sandbox/from-call/{id} retire route Re-run panel on LLM Explorer (§4·a) Editor opens inline below the step. LlmExperiment rows still get written.
/admin/sandbox/{exp_id} retire route Comparison table on LLM Explorer Every experiment shows up as a row under the step it was forked from.
/admin/sandbox/list repurpose /admin/experiments — cross-scan index Filterable by step, model, admin, scan. Same data, different framing.
/admin/evals/{scan_id} retire matrix Quick-compare buttons + comparison table on the LLM Explorer The "rows = prompts, cols = models" matrix is replaced by per-step + model buttons.
/admin/evals index repurpose Cross-scan eval campaigns The place to run a fixture-based regression eval — same prompt × many scans. In-page explainer above the matrix.
llm_calls · llm_experiments · llm_evals tables keep Unchanged The Explorer reads from all three.

What "evals" actually means after the redesign

Today the word eval means two things at once: "compare this scan's prompts across models" (a single-scan question) and "run a fixture-based regression test of our prompts" (a cross-scan question). The redesign splits them:

07 Storage model

The Step Trace is a view, not a new table.

Most of the data the Explorer needs is already in the DB — scan_events for timing, llm_calls for prompts and cost, validation_runs for per-endpoint data. The Trace and Explorer ship without a single schema change. Only the new admin-triggered features (replay, deterministic re-run) need new tables.

7·a · What's already stored

Data pointLives inHow the Explorer uses it
Step start / end times scan_events.created_at on started / complete rows (Python-side wall-clock) Pair the rows → step duration. Already correct after T53.11 fix 1a.
Every LLM call llm_calls row + S3 dump at s3_prompt_path / s3_response_path Trace shows model + cost + latency. Explorer lazy-loads S3 for prompt/response text.
LLM cost + tokens llm_calls.cost_usd, input_tokens, output_tokens, cache_read_tokens, cache_write_tokens Per-step cost rollup is SUM(cost_usd) WHERE created_at BETWEEN step.started AND step.ended.
Per-endpoint validation validation_runs.validation_data JSONB · llm_view / report_view projections Per-endpoint table.
Per-request validation detail validation_runs.validation_data.library_comparison — every fire (lib × proxy) records status, elapsed_ms, body_preview, block detection The Explorer's "expand → per-request matrix" UI is a direct render of this JSON.
Capture blob scan.captured_blob_url (S3 gzipped JSON) Step 01 (capture) explorer link.
Scrubbed blob Same S3 path, overwritten after scrub step (manifest in scan.scrub_manifest) Step 08 (scrub) explorer.
Rendered report report.rendered_html (or S3 cache) Step 13 (render) explorer + sub-page link.
Replay confidence scan.replay_confidence + scan.replay_per_endpoint Step 12 (replay) explorer.
Audit log audit_log rows where resource_id = scan_id Sub-page link.

7·b · What we add

ADD · 01 · required for P3

validation_replays table

One row per admin-triggered replay attempt. Columns: id, scan_id, endpoint_id, library, proxy_tier, request_headers (jsonb), request_body, status, elapsed_ms, body_preview, fired_by, fired_at, deleted_at. Lets the Explorer show replay history and detect site drift.

ADD · 02 · optional

llm_calls.parent_step

Nullable TEXT column. When set, the Trace nests the row under that parent step (↳ notes shape). Without it, infer nesting from prompt_name heuristics. Ships either way.

ADD · 03 · optional

scan_events.event_metadata.duration_ms

Convention, not a column change. On every complete event, write {"duration_ms": N} into the existing JSON column. Saves a pair-join on render. Lazy backfill via on-render compute.

ADD · 04 · maybe

deterministic_reruns table

If we want a persistent log of "ran detection again with current code on 2026-05-30 — diff produced 2 changed findings", we need a row type for it. Alternative: compute-on-the-fly each time admin clicks. Decide in P3 (open Q·05).

What we do not need to add: response bodies aren't currently persisted for validation requests (only body_preview truncated to ~500 chars). That's fine — the Explorer reflects what we have, and a "replay" gives the live current response.
08 Build sequence

Four phases. Each ships standalone.

High-level shape here. The task-by-task TODO list lives in the companion document: html_docs/admin-ui-rebuild-plan.html.

PhaseScopeShipsRisk
P1 Step Trace New /debug/scans/{id} overview view. LLM ledger and audit log become sub-pages. Existing tabs stay accessible behind ?legacy=1 for a release. low · read-only · joins three tables that all have scan_id indexed.
P2 LLM Explorer /debug/scans/{id}/steps/{idx} rendering for LLM steps. Lazy S3 fetch. Re-run panel reuses sandbox.py internals as a service call. low · wraps existing sandbox infra.
P3 Validation Explorer Same route, validation rendering. Per-endpoint expansion uses validation_runs.validation_data. Replay needs the new validation_replays table + a thin service that wraps the validator's library/proxy pool. medium · replay fires real HTTP. Cost cap + audit + super-admin gating.
P4 Cleanup Retire /admin/sandbox/* routes (redirect to step page). Repurpose /admin/evals with in-page explainer. Drop ?legacy=1 escape. low · mostly route deletion + redirects.

Parallel to this, an independent track ships the LLM Models Registry — an admin page to manage every model (company, version, pricing, frontier/reasoning flags, usage stats). Documented in the companion build plan.

09 Decisions log

What we've already settled.

Every question raised in earlier revisions has been answered. Locked-in decisions below — each shapes one or more build-plan tasks.

#QuestionDecision
D · 01 Replay cost / safety decided 50 replays per admin per day. No dry-run mode initially — admin can pick datacenter proxy to keep cost low.
D · 02 Scrubbed-value supply decided Replay form shows [scrubbed] placeholders. Admin types value OR clicks "fire warmup first" to harvest fresh values. Both paths visible.
D · 03 Re-run cost attribution decided Per-admin account, always. Re-run costs charged to the acting admin's budget — never bleed into the original scan's total. Existing $5/day cap unchanged.
D · 04 Deterministic re-run scope decided Ship in P3 for detection / analysis / scrub / replay. Useful when rule code has changed.
D · 05 Deterministic re-run persistence decided Compute-on-fly. No new table. If admin wants a permanent comparison, they screenshot or export.
D · 06 Inline prompt edits decided Free-form text in v1. Structured diff later if mistakes happen.
D · 07 Permission boundary decided View routes admin OR owner. Mutation routes (rerun, replay, model edit) super-admin only.
D · 08 Experiment delete behaviour decided Soft-delete by owner-admin with 30-day grace. Hard-delete super-admin only with required reason. Cascade-delete with parent scan retention.
D · 09 Models registry fallback decided Keep hardcoded pricing table as compile-time fallback if DB is empty. Migration seeds on first deploy.
D · 10 Scan list filter decided /admin/scans shows only completed / errored / cancelled scans by default. In-progress scans hidden (toggle to include). Direct URL access still works for any state.
D · 11 Step coverage decided Build Explorer pages for LLM · validation · deterministic (detection/analysis) · scrub · replay · gate. Skip dedicated Explorer for capture and render — their Trace rows just link to S3 raw / report HTML.
D · 12 Explorer presentation decided Dedicated full-page route per step (/debug/scans/{id}/steps/{idx}) with sticky breadcrumb back to Trace. Not a modal or drawer — content needs full viewport for LLM and validation views, and deep-linking matters for sharing.
D · 13 Error / partial / cancelled states decided See §4·g. Steps before failure render normally · failure point in danger color with inline error · subsequent steps as not reached. Partial scans mark failed substeps with partial pill.
D · 14 Async-action notifications decided Browser Notification API (asks permission on first use, works when tab backgrounded) + in-page toast fallback when permission denied. Triggers: LLM re-run completion (10-15s) · replay completion (2-5s) · model-archive confirmation.
D · 15 Help affordances decided ? tooltip icon next to every non-obvious panel header + every form field. CSS-only on hover (no JS dep). Same pattern across Trace, Explorer, Registry.
D · 16 Audit coverage extension decided Existing scan_view_admin stays. Add llm_dump_view (lazy S3 fetch) and replay_form_open (admin opens the replay editor for a non-own scan). Both write to audit_log.
D · 17 Model archive safeguard decided Archive is blocked if the model is the resolved target for any BROWSER_RECON_LLM_MODEL_* env var or any prompt's default. Admin must unset the env var / change the default first. UI shows which env vars / prompts pin the model.

Still genuinely open

None — every question raised has been resolved. New decisions get logged here as they're made.