Build Plan / admin UI rebuild / 2026-05-19 / ~50% of admin surface

Admin UI rebuild · phased TODO.

Step-by-step plan to ship the scan-debug redesign + a new LLM Models Registry. Each task is atomic (one PR). Click any task to mark it done — state persists in localStorage.

Phases
5
Tasks
atomic
Migrations
2 new tables
Retired routes
3 surfaces
Companion doc
STATUS: shipped 2026-05-20 All 5 phases complete. The Step Trace, per-step Explorers (LLM · Validation · Replay · Scrub), the LLM Models Registry, and the Experiments page are live. P5 retired the ?legacy=1 tabbed debug page and the sandbox / eval-matrix templates. The checkboxes below are localStorage-driven and may not reflect this — treat this banner as the source of truth.
Verdict
Ship in five phases. P1–P3 build the new scan-debug surface (Trace · LLM Explorer · Validation Explorer + Replay) — each ships standalone with a ?legacy=1 escape until stable. P4 ships the LLM Models Registry independently (parallel-safe). P5 retires the old sandbox + eval-matrix routes. Total scope ≈ half the admin surface — but split into atomic PRs that each ship safely on their own.
// progress P1 P2 P3 P4 P5 cross-cutting dependencies questions reset progress
// overall completion 0 / 0
P1 Step Trace overview ~1 week · low risk

Show every micro-step in one table.

Replace the existing tabbed overview with a unified Trace. Read-only. Joins scan_events, llm_calls, validation_runs on scan_id. No schema change. Legacy tabs stay accessible via ?legacy=1.

// tasks 0 / 17 complete
P1-T01Service services/trace_builder.py — join scan_events + llm_calls + validation_runs → ordered list of step dicts. Pure function, unit-testable in isolation.M
P1-T02Pair started / complete rows on scan_events → per-step duration. Handle errored + missing-complete edge cases.S
P1-T03Cost rollup: SUM(llm_calls.cost_usd) within step window. Return total + per-step breakdown.S
P1-T04Nest sub-prompts under parent step. Heuristic v1: notes + difficulty_drivers nest under synthesis by name match. No new column needed yet.S
P1-T05Backend route GET /debug/scans/{id}/trace.json — returns the trace dict. Admin OR owner. Cache key = (scan_id, scan.updated_at).S
P1-T06Template templates/debug_scan_trace.html — main trace table with all columns (#, step, type, started, duration, model, cost, status, s3, explore CTA).M
P1-T07Template — "LLM cost by step" rollup panel (right column of main view).S
P1-T08Template — "Duration breakdown" rollup panel.S
P1-T09Template — sub-pages link strip (audit log · capture timeline · buckets · raw JSON · full report).S
P1-T10Wire GET /debug/scans/{id} to render the new template by default. Old tabbed template kept on disk.S
P1-T11Add ?legacy=1 query param escape — falls back to existing 6-tab template for one release cycle.S
P1-T12Test fixture: persisted scan covering all 9 step types (capture → render) for integration tests.M
P1-T13Tests: trace_builder output shape against fixture · cost rollup arithmetic · errored-step edge case.M
P1-T14Smoke test: template renders without 500 for fixture scan.S
P1-T15Filter /admin/scans list to terminal states only by default (complete/errored/cancelled). Add toggle "include in-progress" for super-admin debugging. Direct URL access to in-progress scan still works.S
P1-T16Search + filter controls on the Trace: filter by step type (llm / validation / deterministic / gate / io) · by status (ok / errored / partial / not_reached) · cost threshold · free-text search across step names + S3 paths. Client-side only (no backend round-trip).M
P1-T17Help-tooltip pattern: ? icon component (CSS-only hover) usable across Trace / Explorer / Registry. Define markup + style once; reuse in every phase. No JS.S
Acceptance
A super-admin opens /debug/scans/{id} and sees all 9 pipeline steps in a single table with durations, costs, S3 paths, and Explore CTAs. Sub-pages strip links to audit log / capture timeline / buckets / report. Explore CTAs land on a 404 placeholder until P2. The /admin/scans list excludes in-progress scans by default. Help tooltips are scaffolded for later phases to apply.
P2 LLM Explorer + re-run ~1 week · low risk · depends on P1

Inspect any LLM call. Re-run with a different model in one click.

Per-step inspector for LLM calls. Surfaces system + user + parsed response (lazy S3). Re-run panel reuses sandbox.py's run_experiment internals as a service. Comparison table accumulates every re-run forever.

// tasks 0 / 16 complete
P2-T01Backend route GET /debug/scans/{id}/steps/{step_idx} — load step from trace_builder, switch render on step type.S
P2-T02Template templates/debug_step_llm.html — main LLM step inspector.M
P2-T03Lazy S3 fetch panel: hits existing /debug/scans/{id}/llm-calls/{call_id}/dump endpoint, renders system + user + response.M
P2-T04Template — timing + metadata panel (prompt_name, version, started_at, ended_at, elapsed, retry_count, S3 paths, copy-curl button).S
P2-T05Service: list LlmExperiment rows where source matches this step's llm_call.id.S
P2-T06Template — "Comparisons" table listing every prior re-run (when, model, edits flag, cost, latency, outcome pill, fired-by, diff CTA).S
P2-T07Backend route POST /debug/scans/{id}/steps/{step_idx}/rerun — wraps existing sandbox.run_experiment. Super-admin only. Cost-cap enforced.M
P2-T08Template — re-run panel: model picker (from registry · see P4) + editable system/user prompts + estimated cost line + fire button.M
P2-T09Template — quick-compare buttons (1-click re-run with same prompts, different model). Reads model list from registry; pre-P4 use hardcoded.S
P2-T10Backend route — experiment detail page (or modal) for inline diff. Reuse existing sandbox_diff.html renderer.S
P2-T11Audit log: every re-run writes an audit_log row with action='step_rerun', metadata {model, edits_flag, cost_usd}.S
P2-T12Tests: rerun handler creates LlmExperiment row · cost-cap denial returns 429 · audit row written.M
P2-T13Smoke test: LLM step page renders for fixture scan + each LLM step.S
P2-T14Browser Notification on re-run completion. Ask permission on first re-run; falls back to in-page toast when permission denied. Single shared notification helper reused in P3.M
P2-T15Audit extension: writing an llm_dump_view row when admin lazy-fetches an S3 dump for a non-own scan. Cap at 1 row per (admin, call_id, hour) to avoid log spam.S
P2-T16Apply ? help tooltips (from P1-T17) to every LLM Explorer panel header + form field. Copy lives in a single tooltips.py dict for translation-ready editing.S
Acceptance
Super-admin clicks Explore on an LLM step, sees prompts and parsed response, picks a different model from the dropdown, hits Run, sees the new result appear as a row in the Comparisons table. Original scan untouched. Browser notification fires when the re-run finishes. Audit log shows the re-run action and any S3 dump views.
P3 Validation Explorer + replay ~2 weeks · medium risk · depends on P1

Drill into every validation request. Replay any one.

Surface the per-request data hiding in validation_data.library_comparison. Add an admin-triggered replay that fires real HTTP through the validator's library/proxy pool. Cost-capped, audited, super-admin gated.

// tasks 0 / 25 complete
P3-T01Migration: new validation_replays table — id, scan_id (FK), endpoint_id, library, proxy_tier, request_headers (JSON), request_body, status, elapsed_ms, body_preview, fired_by, fired_at, deleted_at.S
P3-T02Service parse_library_comparison() — turns validation_data.library_comparison dict into UI-friendly rows (per lib × proxy with status, elapsed, block detection, body preview).M
P3-T03Template templates/debug_step_validation.html — main validation step inspector.M
P3-T04Template — endpoint list (per-endpoint summary row, expand-on-click).S
P3-T05Template — expanded endpoint: per-request matrix (lib × proxy × status × elapsed × block × body preview · replay CTA).M
P3-T06Template — header_reduce panel (which headers required vs optional, with drop-then-test results).S
P3-T07Template — cookie scenario panel (cold / warmup / full with status per scenario).S
P3-T08Template — rate_limit_probe panel (per-round delays, triggers, safe delay estimate, caveat).S
P3-T09Service replay_runner.py — fires one HTTP via existing library pool (requests / httpx / curl_cffi) through specified proxy tier. Returns same envelope shape as library_compare.M
P3-T10Scrubbed-value handling: replay form shows [scrubbed] placeholders; admin types replacement OR clicks "fire warmup first to harvest fresh values."M
P3-T11Service: per-scan-per-day replay cost cap (proposed 50 replays/scan/day). Raises ReplayCapExceeded → 429.S
P3-T12Backend route POST /admin/scans/{id}/replay-validation — super-admin only. Validates inputs, calls replay_runner, persists validation_replays row.M
P3-T13Template — replay panel: pre-populated form (method, URL, lib, proxy, headers, body) with editable fields + fire button + estimated cost.M
P3-T14Template — replay history table (per-endpoint, newest first, diff CTA against original).S
P3-T15Audit: every replay writes audit_log with action='validation_replay', metadata {endpoint_id, library, proxy_tier, status, cost_estimate}.S
P3-T16Soft-delete: replays support deleted_at. Restore within 30 days. Hard-delete super-admin only.S
P3-T17Tests: parse_library_comparison handles partial / blocked / errored cases.M
P3-T18Tests: replay_runner returns expected envelope · cost cap enforced · scrubbed-value substitution.M
P3-T19Smoke test: validation step page renders for blocked endpoint + clean endpoint + partial endpoint.S
P3-T20Scrub Explorer: template templates/debug_step_scrub.html rendering the manifest deltas (counts of stripped cookies / auth headers / form inputs / URL params / response bodies). See §4·e in redesign doc.M
P3-T21Scrub Explorer sample-redactions table: read the scrubbed blob from S3, diff against the unscrubbed blob (if still cached locally), show before/after pairs for ~10 representative redactions. Re-run scrub against current code as a deterministic re-run (no row persisted).M
P3-T22Replay Explorer (pipeline step 12, distinct from per-request replay action): template templates/debug_step_replay.html rendering per-endpoint replay outcomes from scan.replay_per_endpoint + scan.replay_confidence. Diff against captured response shape. See §4·f.M
P3-T23Error / partial / cancelled / in-progress state rendering. Per §4·g: failed-step danger pill + inline error + not reached downstream · cancellation marker · in-progress placeholder with auto-refresh meta tag.M
P3-T24Reuse browser-notification helper from P2-T14 for replay completion events.S
P3-T25Audit extension: replay_form_open action when admin opens the replay editor for a non-own scan. Apply ? tooltips to validation / scrub / replay panels.S
Acceptance
Super-admin clicks Explore on validation step, sees every request fired per endpoint, expands one to see the (lib × proxy) matrix with status / latency / block detection. Clicks replay on any individual fire, supplies the scrubbed token, fires it, sees the live response appended to the history. Cost-capped after 50/day. Scrub and pipeline-replay step pages render their respective summaries. Failed and cancelled scans render the right state shapes.
P4 LLM Models Registry (NEW) ~1 week · low risk · parallel-safe (no dependencies)

Manage every model from one page.

Today, model pricing lives hardcoded in llm/pricing.py and provider routing in llm/routing.py. Moving them to a DB-backed admin UI means: change a price without a deploy, archive a discontinued model with one click, see which models the tool actually uses, and let the LLM Explorer's "+ model" quick-compare buttons stay in sync automatically.

04.1 — ADMIN / LLM MODELS
LLM Models registry
12 active · 3 archived · last sync 4m ago
FILTERS
all (15) anthropic (5) openai (4) xai (2) google (1) discontinued (3) frontier only reasoning only + add model
COMPANYMODEL IDVERFLAGSIN $/MOUT $/MCACHE RCALLS 30DCOST 30DTESTS 30DSTATUS
Anthropicclaude-opus-4-74.7F R$15.00$75.00$1.50347$28.4152activeedit
Anthropicclaude-sonnet-4-64.6$3.00$15.00$0.304,221$184.2021activeedit
Anthropicclaude-haiku-4-54.5$1.00$5.00$0.101,883$21.4014activeedit
Anthropicclaude-3-5-sonnet3.5$3.00$15.00$0.300$0.000discontinuededit
OpenAIgpt-52025-01F$5.00$15.00$0.5089$2.1889activeedit
OpenAIo3F R$60.00$240.003$1.843activeedit
xAIgrok-33F$2.00$10.0017$0.4217activeedit
LEGEND
F frontier R reasoning active available for routing discontinued archived · no longer available
// tasks 0 / 19 complete
P4-T01Migration: new llm_models table. Columns: id, provider_key, company, model_id (unique), display_name, version, is_frontier, is_reasoning, is_discontinued, is_active, supports_structured_output, supports_caching, supports_vision, context_window_tokens, max_output_tokens, input_price_per_million, output_price_per_million, cached_input_price_per_million, cache_read_price_per_million, cache_write_price_per_million, released_at, discontinued_at, notes, created_at, updated_at.M
P4-T02Migration data seed: insert every model from existing llm/pricing.py + llm/routing.py. Test round-trip (seed → load → arithmetic matches).M
P4-T03Service llm/model_resolver.py — load models from DB with 60s in-process cache (mirror pattern in app_config.py). Falls back to hardcoded defaults if DB empty (first-deploy safety).M
P4-T04Refactor llm/pricing.py:compute_cost(model, usage) now reads rates from resolver. Hardcoded table kept as fallback only.M
P4-T05Refactor llm/routing.py:resolve_provider(model) reads provider_key from resolver. Prefix matching kept as fallback only.M
P4-T06Backend route GET /admin/llm-models — list page with filters (company, status, frontier/reasoning) and sort.S
P4-T07Backend route POST /admin/llm-models — create new model. Super-admin only. Validates uniqueness on model_id.S
P4-T08Backend route POST /admin/llm-models/{id} — update model. Bust resolver cache on write.S
P4-T09Backend route POST /admin/llm-models/{id}/archive — set is_discontinued=true, is_active=false, discontinued_at=now(). Bust cache.S
P4-T10Backend route POST /admin/llm-models/{id}/restore — un-archive. Same shape inverted.S
P4-T11Service: aggregation queries — calls_30d + cost_30d from llm_calls; tests_30d from llm_experiments + llm_evals per model. Window: 30 days rolling.M
P4-T12Template templates/admin_llm_models.html — list page with mockup shape (company, model_id, version, flag pills, prices, usage counts, status, edit CTA).M
P4-T13Template templates/admin_llm_model_edit.html — edit/create form with every field. Save / Archive / Cancel actions.M
P4-T14Audit: every model mutation writes audit_logaction='llm_model_create' | 'llm_model_update' | 'llm_model_archive' | 'llm_model_restore'. Metadata captures field deltas on update.S
P4-T15Topbar nav: add "Models" link to admin navigation. Visible to admin + super-admin.S
P4-T16Tests: pricing arithmetic identical pre/post refactor on every existing model · DB-empty fallback works · aggregation counts match raw SQL.M
P4-T17Deploy: migration runs cleanly on prod copy of DB · old hardcoded compute matches new DB-backed compute for the last 30 days of llm_calls rows · canary deploy with rollback plan.M
P4-T18Archive safeguard. Before allowing is_active=false on a model, check: (a) it isn't the value of any BROWSER_RECON_LLM_MODEL_* env var, (b) it isn't the default for any prompt in the routing table. Block the archive with a 409 + structured error listing which pins block it. UI surfaces this as "unset BROWSER_RECON_LLM_MODEL_SCAN_SYNTHESIS first".M
P4-T19Apply ? help tooltips to registry list columns + edit form fields. Define what each pricing field means (per-million tokens, cache rate variants).S
Acceptance
Super-admin opens /admin/llm-models, sees every model with company / version / pricing / status / usage counts. Can click Edit on Sonnet 4.6, change input_price_per_million from $3.00 to $2.50, save — and the next LLM call costs the new rate (no deploy). Archive attempts on a pinned model are blocked with a clear error. Archived non-pinned models hide from routing but remain on the list with a discontinued pill.
P5 Cleanup + retirement ~2-3 days · low risk · depends on P1–P4 stable

Retire the old surfaces. Repurpose what stays.

Once the new Explorer covers every case sandbox + evals covered, the old routes 302 to the new home. The ?legacy=1 escape hatch comes off. The eval matrix's purpose gets renamed and documented.

// tasks 0 / 8 complete
P5-T01Route /admin/sandbox/from-call/{id} → 302 to /debug/scans/{scan_id}/steps/{step_idx} (lookup scan_id from LlmCall).S
P5-T02Route /admin/sandbox/{exp_id} → 302 to step detail with experiment focused (deep link).S
P5-T03Route /admin/sandbox/list → repurpose as /admin/experiments: cross-scan filterable index of every re-run.M
P5-T04Route /admin/evals/{scan_id} matrix → 302 to step page (the quick-compare buttons replace the matrix for single-scan use).S
P5-T05Route /admin/evals repurpose: cross-scan fixture-based regression UI. Add in-page explainer paragraph at the top: "Pick a prompt + fixture set; run all current models against it; see cost / quality / drift over time."M
P5-T06Remove ?legacy=1 escape from /debug/scans/{id}. Delete legacy tabbed template debug_scan.html + its partials.S
P5-T07Update internal documentation: redirect any internal notes / runbooks pointing at retired routes.S
P5-T08Final smoke pass: open every old URL, confirm 302 lands on right new URL · open new Explorer for fixture scan covering every step type.S
Acceptance
Every legacy route either 302s to its new home or has been repurposed with an in-page explainer. No dead UI. No 404. ?legacy=1 no longer works. Single source of truth for scan debugging is the Explorer.
Cross-cutting concerns

Things that apply to every phase.

CROSS · 01 · AUTH

Permission boundaries

View-only routes: admin OR owner. Re-run + replay + model mutation: super-admin only. Wrap every new route in require_role. Confirm in Q·07 of the redesign doc.

CROSS · 02 · AUDIT

Audit every mutation

New action names: step_rerun · validation_replay · llm_model_create · llm_model_update · llm_model_archive · experiment_soft_delete · experiment_hard_delete. Metadata captures the action's deltas.

CROSS · 03 · COST

Cost caps

LLM re-runs: existing $5/admin/day (from sandbox). Validation replays: new 50/scan/day cap. Both raise specific exceptions → 429 with a clear message ("daily cap reached, resets at midnight UTC").

CROSS · 04 · TESTS

Test fixture

P1-T12 builds a persisted-scan fixture covering all 9 step types. P2–P5 reuse it. Every Explorer page has a smoke test against this fixture.

CROSS · 05 · MIGRATIONS

Migration safety

Two new tables (validation_replays, llm_models) + one optional column (llm_calls.parent_step). Run on a prod-copy DB before any prod migration. Both new tables get a rollback path.

CROSS · 06 · OBSERVABILITY

What we measure

Replay API: success rate, p95 latency, cap-hit rate. Re-run API: experiment success rate, retry rate. Models registry: cache hit rate on resolver. Wire to existing admin dashboard cards.

Dependency map

What blocks what.

P1 · Step Trace read-only · 17 tasks P2 · LLM Explorer + re-run · 16 tasks P3 · Val Explorer + replay · 25 tasks P4 · LLM Models Registry independent · 19 tasks parallel-safe P5 · Cleanup retire old · 8 tasks Critical path: P1 → P2 → P3 → P5 (≈ 4-5 weeks sequential) P4 runs parallel from week 1 (+1 week of its own)

Build order recommendation

  1. Week 1: Kick off P1 + P4 in parallel. P1 unblocks P2/P3; P4 unblocks the model picker UI in P2.
  2. Week 2: P1 ships. Start P2. P4 continues.
  3. Week 3: P2 + P4 ship. Start P3.
  4. Week 4: P3 finishes (replay is the longest pole).
  5. Week 5: P3 ships. P5 cleanup. Internal docs.
Decisions log

Everything previously open is now answered.

#TopicDecisionPhase
D · 01Replay cost cap50 replays per admin per day. No dry-run mode v1 — admin can pick datacenter proxy to keep cost low.P3
D · 02Scrubbed-value supplyForm shows [scrubbed] placeholders; admin types value OR clicks "fire warmup first."P3
D · 03Re-run cost attributionPer-admin account. Never bleeds into original scan total. $5/day shared cap.P2 + P3
D · 04Deterministic re-run scopeShip in P3 for detection / analysis / scrub / replay.P3
D · 05Deterministic re-run persistenceCompute-on-fly. No new table. Admin exports or screenshots if needed.P3
D · 06Prompt edit formatFree-form text in v1.P2
D · 07Permission boundaryView admin+owner · mutation super-admin only.all
D · 08Experiment deleteSoft-delete by owner-admin (30d grace). Hard-delete super-admin only with reason.P2 + P3
D · 09Models registry fallbackHardcoded pricing kept as compile-time fallback if DB empty.P4
D · 10Scan list filter/admin/scans shows terminal-state only (complete/errored/cancelled). Toggle to include in-progress.P1
D · 11Step coverageLLM · validation · deterministic · scrub · replay · gate get Explorers. Capture + render: no dedicated Explorer (link to S3 / report directly).P2 · P3
D · 12Explorer presentationDedicated full-page route per step. Sticky breadcrumb. No drawer / modal.P2 onward
D · 13Error statesFailed step danger pill + inline error · subsequent steps not reached. Partial scans mark substeps. In-progress placeholder for direct URL access.P3
D · 14NotificationsBrowser Notification API + in-page toast fallback. Triggers on re-run / replay / model-archive completion.P2 · P3
D · 15Help affordances? tooltip pattern defined in P1-T17. Applied per-phase.all
D · 16Audit coverageAdd llm_dump_view + replay_form_open actions on top of existing scan_view_admin.P2 · P3
D · 17Model archive safeguardBlocked if model is pinned via env var or prompt default. UI surfaces which pins block it.P4

Source: scan-debug-redesign.html · §09. All items locked. Adjust here when new questions arise.