Step-by-step plan to ship the scan-debug redesign + a new LLM Models Registry. Each task is atomic (one PR). Click any task to mark it done — state persists in localStorage.
?legacy=1 tabbed debug page and the sandbox / eval-matrix templates. The checkboxes below are localStorage-driven and may not reflect this — treat this banner as the source of truth.
?legacy=1 escape until stable. P4 ships the LLM Models Registry independently (parallel-safe). P5 retires the old sandbox + eval-matrix routes. Total scope ≈ half the admin surface — but split into atomic PRs that each ship safely on their own.
Replace the existing tabbed overview with a unified Trace. Read-only. Joins scan_events, llm_calls, validation_runs on scan_id. No schema change. Legacy tabs stay accessible via ?legacy=1.
services/trace_builder.py — join scan_events + llm_calls + validation_runs → ordered list of step dicts. Pure function, unit-testable in isolation.started / complete rows on scan_events → per-step duration. Handle errored + missing-complete edge cases.SUM(llm_calls.cost_usd) within step window. Return total + per-step breakdown.notes + difficulty_drivers nest under synthesis by name match. No new column needed yet.GET /debug/scans/{id}/trace.json — returns the trace dict. Admin OR owner. Cache key = (scan_id, scan.updated_at).templates/debug_scan_trace.html — main trace table with all columns (#, step, type, started, duration, model, cost, status, s3, explore CTA).GET /debug/scans/{id} to render the new template by default. Old tabbed template kept on disk.?legacy=1 query param escape — falls back to existing 6-tab template for one release cycle.trace_builder output shape against fixture · cost rollup arithmetic · errored-step edge case./admin/scans list to terminal states only by default (complete/errored/cancelled). Add toggle "include in-progress" for super-admin debugging. Direct URL access to in-progress scan still works.? icon component (CSS-only hover) usable across Trace / Explorer / Registry. Define markup + style once; reuse in every phase. No JS./debug/scans/{id} and sees all 9 pipeline steps in a single table with durations, costs, S3 paths, and Explore CTAs. Sub-pages strip links to audit log / capture timeline / buckets / report. Explore CTAs land on a 404 placeholder until P2. The /admin/scans list excludes in-progress scans by default. Help tooltips are scaffolded for later phases to apply.
Per-step inspector for LLM calls. Surfaces system + user + parsed response (lazy S3). Re-run panel reuses sandbox.py's run_experiment internals as a service. Comparison table accumulates every re-run forever.
GET /debug/scans/{id}/steps/{step_idx} — load step from trace_builder, switch render on step type.templates/debug_step_llm.html — main LLM step inspector./debug/scans/{id}/llm-calls/{call_id}/dump endpoint, renders system + user + response.LlmExperiment rows where source matches this step's llm_call.id.POST /debug/scans/{id}/steps/{step_idx}/rerun — wraps existing sandbox.run_experiment. Super-admin only. Cost-cap enforced.sandbox_diff.html renderer.audit_log row with action='step_rerun', metadata {model, edits_flag, cost_usd}.LlmExperiment row · cost-cap denial returns 429 · audit row written.llm_dump_view row when admin lazy-fetches an S3 dump for a non-own scan. Cap at 1 row per (admin, call_id, hour) to avoid log spam.? help tooltips (from P1-T17) to every LLM Explorer panel header + form field. Copy lives in a single tooltips.py dict for translation-ready editing.Surface the per-request data hiding in validation_data.library_comparison. Add an admin-triggered replay that fires real HTTP through the validator's library/proxy pool. Cost-capped, audited, super-admin gated.
validation_replays table — id, scan_id (FK), endpoint_id, library, proxy_tier, request_headers (JSON), request_body, status, elapsed_ms, body_preview, fired_by, fired_at, deleted_at.parse_library_comparison() — turns validation_data.library_comparison dict into UI-friendly rows (per lib × proxy with status, elapsed, block detection, body preview).templates/debug_step_validation.html — main validation step inspector.replay_runner.py — fires one HTTP via existing library pool (requests / httpx / curl_cffi) through specified proxy tier. Returns same envelope shape as library_compare.[scrubbed] placeholders; admin types replacement OR clicks "fire warmup first to harvest fresh values."ReplayCapExceeded → 429.POST /admin/scans/{id}/replay-validation — super-admin only. Validates inputs, calls replay_runner, persists validation_replays row.audit_log with action='validation_replay', metadata {endpoint_id, library, proxy_tier, status, cost_estimate}.deleted_at. Restore within 30 days. Hard-delete super-admin only.parse_library_comparison handles partial / blocked / errored cases.templates/debug_step_scrub.html rendering the manifest deltas (counts of stripped cookies / auth headers / form inputs / URL params / response bodies). See §4·e in redesign doc.templates/debug_step_replay.html rendering per-endpoint replay outcomes from scan.replay_per_endpoint + scan.replay_confidence. Diff against captured response shape. See §4·f.replay_form_open action when admin opens the replay editor for a non-own scan. Apply ? tooltips to validation / scrub / replay panels.Today, model pricing lives hardcoded in llm/pricing.py and provider routing in llm/routing.py. Moving them to a DB-backed admin UI means: change a price without a deploy, archive a discontinued model with one click, see which models the tool actually uses, and let the LLM Explorer's "+ model" quick-compare buttons stay in sync automatically.
| COMPANY | MODEL ID | VER | FLAGS | IN $/M | OUT $/M | CACHE R | CALLS 30D | COST 30D | TESTS 30D | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Anthropic | claude-opus-4-7 | 4.7 | F R | $15.00 | $75.00 | $1.50 | 347 | $28.41 | 52 | active | edit |
| Anthropic | claude-sonnet-4-6 | 4.6 | — | $3.00 | $15.00 | $0.30 | 4,221 | $184.20 | 21 | active | edit |
| Anthropic | claude-haiku-4-5 | 4.5 | — | $1.00 | $5.00 | $0.10 | 1,883 | $21.40 | 14 | active | edit |
| Anthropic | claude-3-5-sonnet | 3.5 | — | $3.00 | $15.00 | $0.30 | 0 | $0.00 | 0 | discontinued | edit |
| OpenAI | gpt-5 | 2025-01 | F | $5.00 | $15.00 | $0.50 | 89 | $2.18 | 89 | active | edit |
| OpenAI | o3 | — | F R | $60.00 | $240.00 | — | 3 | $1.84 | 3 | active | edit |
| xAI | grok-3 | 3 | F | $2.00 | $10.00 | — | 17 | $0.42 | 17 | active | edit |
llm_models table. Columns: id, provider_key, company, model_id (unique), display_name, version, is_frontier, is_reasoning, is_discontinued, is_active, supports_structured_output, supports_caching, supports_vision, context_window_tokens, max_output_tokens, input_price_per_million, output_price_per_million, cached_input_price_per_million, cache_read_price_per_million, cache_write_price_per_million, released_at, discontinued_at, notes, created_at, updated_at.llm/pricing.py + llm/routing.py. Test round-trip (seed → load → arithmetic matches).llm/model_resolver.py — load models from DB with 60s in-process cache (mirror pattern in app_config.py). Falls back to hardcoded defaults if DB empty (first-deploy safety).llm/pricing.py:compute_cost(model, usage) now reads rates from resolver. Hardcoded table kept as fallback only.llm/routing.py:resolve_provider(model) reads provider_key from resolver. Prefix matching kept as fallback only.GET /admin/llm-models — list page with filters (company, status, frontier/reasoning) and sort.POST /admin/llm-models — create new model. Super-admin only. Validates uniqueness on model_id.POST /admin/llm-models/{id} — update model. Bust resolver cache on write.POST /admin/llm-models/{id}/archive — set is_discontinued=true, is_active=false, discontinued_at=now(). Bust cache.POST /admin/llm-models/{id}/restore — un-archive. Same shape inverted.calls_30d + cost_30d from llm_calls; tests_30d from llm_experiments + llm_evals per model. Window: 30 days rolling.templates/admin_llm_models.html — list page with mockup shape (company, model_id, version, flag pills, prices, usage counts, status, edit CTA).templates/admin_llm_model_edit.html — edit/create form with every field. Save / Archive / Cancel actions.audit_log — action='llm_model_create' | 'llm_model_update' | 'llm_model_archive' | 'llm_model_restore'. Metadata captures field deltas on update.llm_calls rows · canary deploy with rollback plan.is_active=false on a model, check: (a) it isn't the value of any BROWSER_RECON_LLM_MODEL_* env var, (b) it isn't the default for any prompt in the routing table. Block the archive with a 409 + structured error listing which pins block it. UI surfaces this as "unset BROWSER_RECON_LLM_MODEL_SCAN_SYNTHESIS first".? help tooltips to registry list columns + edit form fields. Define what each pricing field means (per-million tokens, cache rate variants)./admin/llm-models, sees every model with company / version / pricing / status / usage counts. Can click Edit on Sonnet 4.6, change input_price_per_million from $3.00 to $2.50, save — and the next LLM call costs the new rate (no deploy). Archive attempts on a pinned model are blocked with a clear error. Archived non-pinned models hide from routing but remain on the list with a discontinued pill.
Once the new Explorer covers every case sandbox + evals covered, the old routes 302 to the new home. The ?legacy=1 escape hatch comes off. The eval matrix's purpose gets renamed and documented.
/admin/sandbox/from-call/{id} → 302 to /debug/scans/{scan_id}/steps/{step_idx} (lookup scan_id from LlmCall)./admin/sandbox/{exp_id} → 302 to step detail with experiment focused (deep link)./admin/sandbox/list → repurpose as /admin/experiments: cross-scan filterable index of every re-run./admin/evals/{scan_id} matrix → 302 to step page (the quick-compare buttons replace the matrix for single-scan use)./admin/evals repurpose: cross-scan fixture-based regression UI. Add in-page explainer paragraph at the top: "Pick a prompt + fixture set; run all current models against it; see cost / quality / drift over time."?legacy=1 escape from /debug/scans/{id}. Delete legacy tabbed template debug_scan.html + its partials.?legacy=1 no longer works. Single source of truth for scan debugging is the Explorer.
View-only routes: admin OR owner. Re-run + replay + model mutation: super-admin only. Wrap every new route in require_role. Confirm in Q·07 of the redesign doc.
New action names: step_rerun · validation_replay · llm_model_create · llm_model_update · llm_model_archive · experiment_soft_delete · experiment_hard_delete. Metadata captures the action's deltas.
LLM re-runs: existing $5/admin/day (from sandbox). Validation replays: new 50/scan/day cap. Both raise specific exceptions → 429 with a clear message ("daily cap reached, resets at midnight UTC").
P1-T12 builds a persisted-scan fixture covering all 9 step types. P2–P5 reuse it. Every Explorer page has a smoke test against this fixture.
Two new tables (validation_replays, llm_models) + one optional column (llm_calls.parent_step). Run on a prod-copy DB before any prod migration. Both new tables get a rollback path.
Replay API: success rate, p95 latency, cap-hit rate. Re-run API: experiment success rate, retry rate. Models registry: cache hit rate on resolver. Wire to existing admin dashboard cards.
| # | Topic | Decision | Phase |
|---|---|---|---|
| D · 01 | Replay cost cap | 50 replays per admin per day. No dry-run mode v1 — admin can pick datacenter proxy to keep cost low. | P3 |
| D · 02 | Scrubbed-value supply | Form shows [scrubbed] placeholders; admin types value OR clicks "fire warmup first." | P3 |
| D · 03 | Re-run cost attribution | Per-admin account. Never bleeds into original scan total. $5/day shared cap. | P2 + P3 |
| D · 04 | Deterministic re-run scope | Ship in P3 for detection / analysis / scrub / replay. | P3 |
| D · 05 | Deterministic re-run persistence | Compute-on-fly. No new table. Admin exports or screenshots if needed. | P3 |
| D · 06 | Prompt edit format | Free-form text in v1. | P2 |
| D · 07 | Permission boundary | View admin+owner · mutation super-admin only. | all |
| D · 08 | Experiment delete | Soft-delete by owner-admin (30d grace). Hard-delete super-admin only with reason. | P2 + P3 |
| D · 09 | Models registry fallback | Hardcoded pricing kept as compile-time fallback if DB empty. | P4 |
| D · 10 | Scan list filter | /admin/scans shows terminal-state only (complete/errored/cancelled). Toggle to include in-progress. | P1 |
| D · 11 | Step coverage | LLM · validation · deterministic · scrub · replay · gate get Explorers. Capture + render: no dedicated Explorer (link to S3 / report directly). | P2 · P3 |
| D · 12 | Explorer presentation | Dedicated full-page route per step. Sticky breadcrumb. No drawer / modal. | P2 onward |
| D · 13 | Error states | Failed step danger pill + inline error · subsequent steps not reached. Partial scans mark substeps. In-progress placeholder for direct URL access. | P3 |
| D · 14 | Notifications | Browser Notification API + in-page toast fallback. Triggers on re-run / replay / model-archive completion. | P2 · P3 |
| D · 15 | Help affordances | ? tooltip pattern defined in P1-T17. Applied per-phase. | all |
| D · 16 | Audit coverage | Add llm_dump_view + replay_form_open actions on top of existing scan_view_admin. | P2 · P3 |
| D · 17 | Model archive safeguard | Blocked if model is pinned via env var or prompt default. UI surfaces which pins block it. | P4 |
Source: scan-debug-redesign.html · §09. All items locked. Adjust here when new questions arise.