Metadata-Version: 2.4
Name: probat-mcp
Version: 0.1.0
Summary: Visual UI snip tool for A/B variant generation — drag to select a component, get design variants instantly
License: MIT
Keywords: a/b testing,design,mcp,playwright,ui
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27.0
Requires-Dist: mcp[cli]>=1.0.0
Requires-Dist: playwright>=1.40.0
Description-Content-Type: text/markdown

<p align="center">
  <img src="images/orpheus_logo.png" alt="Orpheus Logo" width="200">
</p>

# Orpheus — Technical Design Document

**Version:** 1.0
**Date:** 2026-03-19
**Status:** Draft

---

## Table of Contents

1. Executive Summary
2. Problem Statement
3. System Overview
4. Component Design
5. Data Models
6. Retrieval Pipeline
7. Learning-to-Rank System
8. Attribution System
9. Human-in-the-Loop Flow
10. Multi-Metric Ingestion
11. Corpus Design
12. Corpus Quality Validation
13. Corpus Improvement Loops
14. Integration Architecture (PostHog, GA4)
15. Limitations and Known Tradeoffs
16. React SDK Expansion Plan
17. One-Week MVP
18. 4-Week Build Plan (Compressed Rollout)
19. Project Structure

---

## 1. Executive Summary

Design decisions in UI/UX work are currently made through a combination of intuition, tribal knowledge, and periodic A/B testing — but these three information sources rarely talk to each other. A designer proposes a variant, engineers run the test, results go into a spreadsheet, and the next designer starts from scratch. The A/B UI Variant Recommender closes this loop: it generates design variants informed by what has actually worked, improves from every test result and user selection it observes, and always explains exactly why it proposed what it proposed.

The system operates as an MCP (Model Context Protocol) server, integrating directly into AI coding tools like Claude Code, Cursor, and Windsurf. When a developer or designer asks for UI variants for a hero section, pricing table, or CTA button, the system retrieves the top candidates per domain from a curated corpus (style, color, typography, layout, UX patterns), anchors Variant 1 on the `ui-reasoning.csv` expert recommendation for that product/page context, then selects Variants 2 and 3 by maximizing embedding distance from already-chosen docs — guaranteeing visual diversity without pigeonholing into pre-defined strategy buckets. Strategy labels (e.g. "bold", "layered") are assigned post-hoc as descriptive hypotheses, not used to constrain retrieval. A learning-to-rank model trained on selection events, edit events, and A/B outcomes re-ranks future recommendations. Every variant ships with full attribution tracing back to specific corpus documents and experiments. The customer selects one, optionally edits it, runs the A/B test, and the result flows back into the model.

Three non-negotiable requirements drive every architectural decision: ideation quality (the variants must be genuinely good starting points), adaptability (the system must improve from outcomes), and referenceability (every recommendation must be traceable to specific evidence a human can read and evaluate).

---

## 2. Problem Statement

### Requirements

**R1 — Good ideation**
Given a query describing a UI component and context, the system must return n variants that are meaningfully different from each other and represent genuinely good design starting points. "Good" means: consistent with industry best practices, appropriate for the product type and page context, and diverse enough that running an A/B test between them would yield actionable learning.

**R2 — Adaptability**
Recommendations must improve over time as A/B test results and user selections accumulate. A customer who has run 50 experiments should receive materially better recommendations than on day 1. The learning must be sample-efficient — A/B tests are expensive and slow (days to weeks per experiment).

**R3 — Referenceability**
Every variant must be accompanied by an explanation that a developer or PM can evaluate. "The model thinks this is good" is not acceptable. Acceptable: "This pattern has been selected 8 times over alternatives in similar contexts, and experiment exp-291 (acme-corp, checkout CTA, +23% CVR) contributes 0.18 weight to this recommendation."

### Why Existing Tools Don't Solve This

- **Design tools** (Figma, Framer): great for visual design, no connection to conversion data
- **A/B testing tools** (Optimizely, VWO): measure variants, don't generate them
- **AI coding tools** (Copilot, Cursor): generate UI code with no design intelligence
- **Component libraries** (shadcn, MUI): solve consistency, not ideation or optimization

No existing tool owns the gap between "what should I try" and "what actually worked."

---

## 3. System Overview

![Orpheus architecture](images/orpheus-architecture.svg)

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                              ORPHEUS MCP SERVER                              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │              RETRIEVAL — per domain, unconstrained                     │  │
│  │                                                                        │  │
│  │  Query ──┬── BM25 (sparse) ──┐                                        │  │
│  │          └── Dense (ChromaDB)─┴── RRF Fusion → top 8 per domain       │  │
│  │                                                                        │  │
│  │  Domains: style · color · typography · landing · ux                   │  │
│  │  One shared embedding per request (single OpenAI call)                 │  │
│  └────────────────────────┬───────────────────────────────────────────────┘  │
│                           │  top 8 candidates × 5 domains                   │
│  ┌────────────────────────▼───────────────────────────────────────────────┐  │
│  │                   CORPUS-ANCHORED COMPOSITION                          │  │
│  │                                                                        │  │
│  │  ┌─────────────────────────────────────────────────────┐               │  │
│  │  │  VARIANT 1 — ui-reasoning.csv anchor                │               │  │
│  │  │                                                     │               │  │
│  │  │  lookup(product_type, page_context)                 │               │  │
│  │  │    → Style_Priority, Color_Mood, Typography_Mood    │               │  │
│  │  │  select_best_match(candidates, reasoning_keywords)  │               │  │
│  │  │    → 1 doc per domain                               │               │  │
│  │  └──────────────────────┬──────────────────────────────┘               │  │
│  │                         │ used_doc_ids = {v1 doc ids}                  │  │
│  │  ┌──────────────────────▼──────────────────────────────┐               │  │
│  │  │  VARIANT 2 — max embedding distance from V1          │               │  │
│  │  │                                                     │               │  │
│  │  │  for each domain:                                   │               │  │
│  │  │    argmax cosine_distance(candidate, v1_doc)        │               │  │
│  │  │    where candidate.doc_id not in used_doc_ids       │               │  │
│  │  └──────────────────────┬──────────────────────────────┘               │  │
│  │                         │ used_doc_ids += {v2 doc ids}                 │  │
│  │  ┌──────────────────────▼──────────────────────────────┐               │  │
│  │  │  VARIANT 3 — max distance from V1 and V2            │               │  │
│  │  │                                                     │               │  │
│  │  │  for each domain:                                   │               │  │
│  │  │    argmax min(dist(c, v1), dist(c, v2))             │               │  │
│  │  │    where candidate.doc_id not in used_doc_ids       │               │  │
│  │  └──────────────────────┬──────────────────────────────┘               │  │
│  │                         │                                               │  │
│  │  ┌──────────────────────▼──────────────────────────────┐               │  │
│  │  │  POST-HOC LABELING                                  │               │  │
│  │  │  assign nearest strategy label to each bundle       │               │  │
│  │  │  (descriptive only — does not affect selection)     │               │  │
│  │  │  apply brand_constraints (filter + override)        │               │  │
│  │  └──────────────────────┬──────────────────────────────┘               │  │
│  └────────────────────────┬───────────────────────────────────────────────┘  │
│                           │ 3 ComposedVariants + full attribution            │
│  ┌────────────────────────▼───────────────────────────────────────────────┐  │
│  │                  LTR RE-RANKING (bundle-level)                         │  │
│  │                                                                        │  │
│  │  Features: style_doc_win_rate, color_doc_win_rate,                    │  │
│  │            style_color_combo_win_rate, context_similarity              │  │
│  │                                                                        │  │
│  │  LambdaRank → re-order 3 bundles by predicted win probability         │  │
│  │  (global baseline → per-customer model as data accumulates)            │  │
│  └────────────────────────┬───────────────────────────────────────────────┘  │
│                           ▼                                                  │
│              3 ranked ComposedVariants with full attribution                 │
└───────────────────────────┬──────────────────────────────────────────────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │  VS Code    │ │  Web App    │ │  MCP Tool   │
     │  Webview    │ │  Dashboard  │ │  Response   │
     │  (engineer) │ │  (PM/design)│ │  (API)      │
     └──────┬──────┘ └──────┬──────┘ └─────────────┘
            │               │
            ▼               ▼
  ┌──────────────────────────────────────────────────┐
  │                 FEEDBACK SIGNALS                 │
  │                                                  │
  │  1. select  → selection_events  (bundle-level)   │
  │              pairwise labels, weight 0.3         │
  │                                                  │
  │  2. edit    → edit_events  (domain-level) ◄───── │─ most granular signal
  │     swap_doc:  original_doc → replacement_doc    │
  │     freeform:  original_doc → corpus gap flagged │
  │     partial:   same doc, minor tweak (0.3× wt)  │
  │                                                  │
  │  3. A/B     → ab_results  (CVR, return_7d, ...)  │
  │              staged: day 0 / day 7 / day 30      │
  │                                                  │
  │  4. SDK     → probat_events  (rage, dead clicks) │
  │  5. heatmap → heatmap_clicks                     │──► Supabase
  └────────────────────┬─────────────────────────────┘
                       │
          ┌────────────┴────────────┐
          ▼                         ▼
  ┌──────────────────┐   ┌────────────────────────┐
  │  NIGHTLY WIN RATE │   │  LTR RETRAINING        │
  │  UPDATE           │   │                        │
  │                   │   │  selection  ──┐        │
  │  update           │   │  edits      ──┼─► ltr_ │
  │  global_win_rate  │   │  ab_results ──┘  training_examples
  │  in ChromaDB      │   │                        │
  │  metadata per doc │   │  Confidence decay:     │
  │                   │   │  half-life = 90 days   │
  └──────────────────┘   └────────────────────────┘
```

---

## 4. Component Design

### 4.1 Corpus (Knowledge Base)

**What it is:** A collection of ~2,000 prose documents describing design patterns, serialized from CSV databases covering 10 domains (style, color, typography, charts, landing patterns, product types, UX guidelines, icons, React performance, app interface guidelines).

**Why prose, not raw CSV:** Embedding models produce higher-quality vectors when they see coherent language. `"Style: Glassmorphism. Keywords: frosted glass, transparent panels, blur. Best For: SaaS dashboards, fintech apps."` embeds better than `"Glassmorphism,frosted glass,transparent,blur,SaaS"`.

**Why not a relational DB:** The documents are the retrieval unit. You need embeddings per document, BM25 scores per document, and LTR features per document. A flat document store (ChromaDB) is the correct abstraction.

**Representation decision:** Hybrid. Each document contains:
- Abstract design principles (for semantic embedding quality)
- Concrete implementation hints per stack (Tailwind classes, CSS variables)
- Synonym expansion (for BM25 lexical coverage)

This is better than pure abstract (too vague to implement) or pure code (not generalizable across stacks).

### 4.2 Retrieval + Composition Pipeline

**Purpose:** Retrieve the top candidate docs per domain from the corpus, then compose them into three visually distinct, hypothesis-driven variant bundles. No LLM is involved in the design decisions.

**Critical distinction:** A variant is NOT a single retrieved document. A variant is a **composed combination** of 5 design decisions (style + color + typography + layout + UX pattern) assembled under a testable hypothesis. Retrieval finds the candidate pool; composition selects one doc per domain per variant, guaranteeing diversity across all three bundles.

**Why corpus-anchored, not strategy-constrained:** Pre-defined strategy buckets (e.g. trust/urgency/simplicity) pigeonhole retrieval — "bold" may be wrong for fintech, "trust" and "simplicity" may both retrieve Minimalism docs and produce identical variants. Instead, Variant 1 is anchored on `ui-reasoning.csv` (161 rows of curated expert recommendations by product/page type), and Variants 2 and 3 are selected by maximizing embedding distance from already-chosen docs. Strategy labels are assigned post-hoc as human-readable hypotheses. The corpus — not a hardcoded bucket — determines what "different" means for a given query.

```
Step 1: Retrieve top 8 candidates per domain (shared embedding, 1 OpenAI call)
Step 2: Variant 1 — retrieve best matching ui_reasoning corpus doc via BM25 + dense
        (same pipeline, domain_filter="ui_reasoning"), parse its prose to extract
        Style Priority / Color Mood / Typography Mood, then keyword-score candidates
Step 3: Variant 2 — for each domain, pick the candidate most distant in embedding
        space from Variant 1's doc (not in used_doc_ids)
Step 4: Variant 3 — for each domain, pick the candidate most distant from both
        Variant 1 and Variant 2 (not in used_doc_ids)
Step 5: Assign post-hoc label (e.g. "bold", "layered") to each bundle by nearest
        strategy embedding — descriptive only, does not affect selection
  → LTR ranks bundles by predicted win probability
  → Attribution traces each bundle back to its source docs
```

**Why hybrid (BM25 + dense):** They fail in opposite directions.
- BM25 misses semantic equivalence: "frosted glass" doesn't match "glassmorphism"
- Dense embeddings miss exact technical terms: `backdrop-filter` might not rank a glassmorphism doc above a general blur doc
- Hybrid retrieval consistently outperforms either alone by 5-15% on technical queries

**Why RRF over score normalization:** BM25 scores are unbounded floats; cosine similarities are 0-1. Averaging them requires arbitrary scale normalization. RRF uses only rank positions — rank 1 is rank 1 regardless of scale. Documents ranked highly by both methods score highest.

**Why Cohere rerank:** Both BM25 and embeddings are bi-encoders — they encode query and document independently. The cross-encoder (Cohere) reads them together, enabling it to understand which specific phrases in the query correspond to which parts of the document. This is qualitatively different and consistently higher quality. Run only on top 20 candidates to control latency and cost.

**Why MMR for candidate diversity:** Without diversity enforcement, retrieved candidates might cluster around one style. MMR penalizes candidates similar to already-selected ones, ensuring each strategy's search returns diverse building blocks.

**Why deterministic composition over LLM composition:** LLM composition has critical failure modes: fake diversity ("minimalist + clean" vs. "simple + modern" = same variant), incoherent bundles (brutalist typography + glassmorphism cards), vague hypotheses ("this is more engaging"), and hallucinated patterns not in the corpus. Deterministic strategy-based composition avoids all of these — strategies are pre-defined to differ on specific axes, each strategy has internally consistent moods, and every attribute maps directly to a search result. It's also faster (<100ms vs. 1-3s), free (no LLM call), deterministic (same input = same output), and attribution is automatic. This mirrors the approach in the existing `DesignSystemGenerator` which already does rule-based multi-domain search → priority scoring → deterministic assembly.

### 4.3 Learning-to-Rank (LambdaRank)

**Purpose:** Re-rank retrieved candidates based on what has worked for this customer in similar contexts.

**Why LTR over signal blending (the alternative):** Signal blending assigns per-attribute win rates independently (glassmorphism: 0.4 win rate, minimalism: 0.7 win rate). LTR learns that glassmorphism + dark background + subtle animation is a coherent package that wins together — it models attribute interactions. LTR also directly optimizes a ranking objective (NDCG) rather than a heuristic score formula.

**Why LambdaRank specifically over Bayesian optimization:** BO optimizes a single function with sample efficiency — excellent for hyperparameter tuning. But it cannot generate candidates (only select between fixed arms), its GP posterior is not interpretable (kills R3), and it doesn't handle categorical design attributes natively without extensions (CoCaBO). LambdaRank is interpretable (feature importance = which design attributes matter), referenceable (training examples are specific experiments), and fits your data volume (hundreds of training examples, not thousands).

**Why LTR over contextual bandits (LinUCB):** LinUCB requires pre-defined arms. Your design space is open-ended — corpus grows. LinUCB's UCB scores are not user-facing explanations. LTR feature importance is.

### 4.4 Attribution System

**Purpose:** Satisfy R3. Every recommendation must be traceable to specific evidence.

**Why stored at generation time:** Attribution cannot be reconstructed after the fact. The LTR model weights, the specific experiments that contributed, the retrieval scores at each stage — these change as the model retrains. Store them when the recommendation is generated or lose them permanently.

### 4.5 Human-in-the-Loop

**Purpose:** Capture selection and edit signals as free training data for LTR.

**Key insight:** Selection events are pairwise LTR labels. Customer picks B over A → `(context, B_attrs, A_attrs): B > A`. These arrive before any A/B test runs and cost nothing to collect. Edit signals are preference labels at attribute granularity. Both improve the LTR model without requiring additional A/B experiments.

---

## 5. Data Models

### 5.1 ChromaDB Collection: `design_knowledge`

```python
# One document per CSV row, serialized as prose
{
  "id":        "style_styles_3",           # domain_file_rowindex
  "document":  "Style: Glassmorphism. Type: General. Keywords: frosted glass...",
  "embedding": [...],                      # text-embedding-3-large, 3072-dim
  "metadata": {
    "domain":           "style",
    "source_file":      "styles.csv",
    "row_index":        3,
    "style_category":   "Glassmorphism",
    "complexity":       "Medium",          # Low / Medium / High
    "performance":      "Good",
    "accessibility":    "Medium",
    "product_type":     "",                # populated for color + product domains
    "severity":         "",                # populated for ux + react + web domains
    "global_win_rate":  0.0,               # updated from A/B outcomes
    "global_appearance_count": 0,
  }
}
```

### 5.2 Supabase (Postgres) — Existing Tables Reused

Orpheus builds on the existing Probat Supabase schema. These tables are already in production and need no changes:

| Existing Table | Orpheus Role |
|---|---|
| `app_users` | Customer identity (replaces design doc's `customers`) |
| `experiment_runs` | Top-level experiment record |
| `experiment_configs` | Variant weights, status, activation |
| `experiments` | Individual variants (`json_variant`, `react_variant`, `component`, `platform`) |
| `experiment_proposals` | Maps to Orpheus queries (`site_url`, `component`, `payload`) |
| `probat_events` | SDK events — impressions, clicks, conversions (generic event table) |
| `probat_assignments` | Sticky variant assignments per visitor |
| `experiment_metrics` | Time-series metrics with `metric_name`, `metric_value`, `dimensions` |
| `metric_definitions` | **Extensible metric registry** — already has `name`, `description`, `unit`, `is_higher_better`. New metrics = new row, no migration. |
| `heatmap_clicks` | Click coordinates, element info, session tracking |
| `heatmap_cursor_movements` | Cursor movement tracking |

### 5.3 Supabase (Postgres) — New Orpheus Tables`

These tables extend the existing schema for Orpheus-specific functionality:

```sql
-- Hard constraints on design recommendations — applied as post-retrieval overrides
-- token_type: 'primary_color' | 'secondary_color' | 'cta_color' |
--             'font_heading' | 'font_body' | 'border_radius' |
--             'forbidden_style' | 'required_style'
CREATE TABLE brand_tokens (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    token_type      TEXT NOT NULL,
    token_value     TEXT NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_brand_tokens_user ON brand_tokens(user_id, token_type);

-- Soft guidance — applied as LTR feature weights
CREATE TABLE soft_guidance (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    guidance_text   TEXT NOT NULL,
    weight_delta    REAL NOT NULL DEFAULT 1.0,  -- multiplier on relevant features
    created_at      TIMESTAMPTZ DEFAULT now()
);

-- Customer optimization goals for composite scoring
CREATE TABLE optimization_goals (
    user_id          UUID PRIMARY KEY REFERENCES app_users(id) ON DELETE CASCADE,
    primary_metric   TEXT NOT NULL DEFAULT 'cvr',
    secondary_metric TEXT,
    guardrail_json   JSONB  -- {"bounce_rate": {"max": 0.60}, "rage_click_rate": {"max": 0.05}}
);

-- Orpheus variant attribution — stored at generation time, links to existing experiments
CREATE TABLE orpheus_variants (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_run_id UUID NOT NULL REFERENCES experiment_runs(id) ON DELETE CASCADE,
    variant_id        TEXT NOT NULL,
    hypothesis        TEXT NOT NULL,          -- testable claim
    attributes_json   JSONB NOT NULL,         -- full attribute dict
    source_docs_json  JSONB,                  -- [doc_id, ...] from retrieval
    attribution_json  JSONB,                  -- full attribution at generation time
    was_selected      BOOLEAN DEFAULT FALSE,
    was_edited        BOOLEAN DEFAULT FALSE,
    edit_distance     INTEGER DEFAULT 0,
    edited_attrs_json JSONB,                  -- post-edit attributes (for attribution rule)
    created_at        TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_orpheus_variants_run ON orpheus_variants(experiment_run_id);

-- Pairwise LTR training data from selection events
-- selected_variant_id was preferred over each rejected_variant_id
CREATE TABLE selection_events (
    id                   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_run_id    UUID NOT NULL REFERENCES experiment_runs(id) ON DELETE CASCADE,
    user_id              UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    component_type       TEXT,
    page_context         TEXT,
    selected_variant_id  TEXT NOT NULL,
    rejected_variant_ids JSONB NOT NULL,       -- JSON array of variant IDs
    weight               REAL DEFAULT 0.3,     -- weaker than A/B result
    recorded_at          TIMESTAMPTZ DEFAULT now()
);

-- Edit events — domain-level preference signals (more granular than selection events)
-- edit_type: 'swap_doc'  — user replaced the corpus doc for a domain
--                          original_doc_id → replacement_doc_id (both known)
--                          signal: negative label on original, positive on replacement
--            'freeform'  — user typed their own value, not in corpus
--                          original_doc_id → NULL (corpus gap flagged)
--                          signal: negative label on original, gap candidate logged
--            'partial'   — user kept the doc but tweaked a parameter (e.g. blur intensity)
--                          signal: soft negative at 0.3× weight
CREATE TABLE edit_events (
    id                    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_run_id     UUID NOT NULL REFERENCES experiment_runs(id) ON DELETE CASCADE,
    user_id               UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    variant_id            TEXT NOT NULL,
    domain                TEXT NOT NULL,         -- 'style' | 'color' | 'typography' | 'landing' | 'ux'
    edit_type             TEXT NOT NULL,         -- 'swap_doc' | 'freeform' | 'partial'
    original_doc_id       TEXT NOT NULL,         -- corpus doc that was replaced
    replacement_doc_id    TEXT,                  -- NULL for freeform edits
    original_attrs_json   JSONB NOT NULL,        -- full bundle before edit
    edited_attrs_json     JSONB NOT NULL,        -- full bundle after edit
    edit_distance         INTEGER NOT NULL,      -- total fields changed across all domains
    confidence_weight     REAL NOT NULL,         -- 0.8 (≤2 edits), 0.4 (≤5), 0.1 (>5)
    page_type             TEXT,
    industry              TEXT,
    recorded_at           TIMESTAMPTZ DEFAULT now()
);

-- Auto-apply patterns: repeated edits of same attribute
CREATE TABLE auto_apply_patterns (
    user_id         UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    attribute_key   TEXT NOT NULL,
    preferred_value TEXT NOT NULL,
    frequency       INTEGER NOT NULL,
    last_seen       TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (user_id, attribute_key)
);

-- Staged A/B results — composite scores per stage
-- Metrics themselves live in experiment_metrics (existing table)
-- This table stores the computed composite scores and winner determination
CREATE TABLE ab_results (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_run_id UUID NOT NULL REFERENCES experiment_runs(id) ON DELETE CASCADE,
    variant_id      TEXT NOT NULL,
    stage           INTEGER NOT NULL DEFAULT 1,  -- 1, 2, or 3
    impressions     INTEGER,
    conversions     INTEGER,
    metrics_json    JSONB,           -- snapshot of all metrics for this stage
    composite_score REAL,            -- weighted composite across metrics
    is_winner       BOOLEAN DEFAULT FALSE,
    test_confidence REAL,
    recorded_at     TIMESTAMPTZ DEFAULT now()
);

-- LTR training examples (derived from selection_events + ab_results)
CREATE TABLE ltr_training_examples (
    id                      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id                 UUID NOT NULL REFERENCES app_users(id) ON DELETE CASCADE,
    component_type          TEXT,
    page_context            TEXT,
    query                   TEXT,
    context_features_json   JSONB NOT NULL,    -- all feature values for context
    variant_features_json   JSONB NOT NULL,    -- all feature values for variant
    label                   REAL NOT NULL,     -- relevance score: 0.0 to 1.0
    weight                  REAL DEFAULT 1.0,
    source                  TEXT,  -- 'selection' | 'ab_result_stage1' | 'ab_result_stage2' | 'ab_result_stage3'
    created_at              TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_ltr_user ON ltr_training_examples(user_id, component_type, page_context);
```

**Key integration points with existing schema:**
- `experiment_runs.id` is the FK for all Orpheus tables (not a separate experiment system)
- Individual metrics flow through `experiment_metrics` (existing) → `ab_results` (new) computes composite scores
- `metric_definitions` (existing) is the extensible registry — new metrics = `INSERT INTO metric_definitions` + adapter, no migration
- `probat_events` (existing) captures SDK events — Orpheus reads from it for rage clicks, dead clicks, form abandonment
- `heatmap_clicks` (existing) feeds frustration signals into composite scoring

### 5.4 ComposedVariant Dataclass

The `DesignVariant` name is retired. The composed bundle is now `ComposedVariant` — one instance per variant returned to the user. The individual retrieved doc is a `RetrievedDoc` (what the retrieval pipeline returns).

```python
@dataclass
class RetrievedDoc:
    """A single doc retrieved from ChromaDB. Output of RetrievalPipeline."""
    doc_id:     str
    text:       str
    metadata:   dict
    embedding:  list[float]   # stored embedding from ChromaDB (no extra API call)
    bm25_rank:  int | None
    dense_rank: int | None
    rrf_score:  float

@dataclass
class ComposedVariant:
    """A fully assembled design variant bundle. Output of compose_variants()."""
    variant_id:  str
    label:       str   # post-hoc strategy label: "minimal" | "bold" | "layered"
                       # assigned by nearest strategy embedding, NOT used for selection
    hypothesis:  str   # human-readable testable claim derived from label

    # One doc per domain — selected by ui-reasoning.csv anchor (V1) or max embedding distance (V2, V3)
    docs: dict[str, RetrievedDoc]  # domain → doc: style, color, typography, landing, ux

    # Attribution — stored at generation time, never reconstructed later
    source_doc_ids:           list[str]    # [doc_id for each domain]
    selection_method:         str          # "ui_reasoning_anchor" | "max_distance_v2" | "max_distance_v3"
    reasoning_rule_used:      dict         # the ui-reasoning.csv row that drove V1 (None for V2, V3)
    retrieval_scores:         dict         # {doc_id: {bm25_rank, dense_rank, rrf_score}}
    ltr_feature_importance:   dict         # {feature_name: weight} — populated after LTR built
    contributing_experiments: list[dict]   # [{experiment_id, customer, result, weight}] — post-LTR
    corpus_basis:             list[str]    # human-readable: "Style: Minimalism — style_styles_0"
    confidence:               float        # composite 0.0-1.0
    low_confidence_flag:      bool

    # Constraint and auto-apply tracking
    brand_applied:      bool
    auto_applied_attrs: list[str]   # attributes auto-applied from user's edit patterns
    flags:              list[str]   # warnings (forbidden style, low confidence, cold start, etc.)
```

---

## 6. Retrieval Pipeline

### 6.1 Stage 1 — Hybrid Recall

**BM25 (sparse retrieval):**
```python
# rank_bm25, BM25Okapi, k1=1.5 b=0.75 (defaults)
# Index built from tokenized prose documents at server startup
# Corpus: ~2000 documents, fits in memory, rebuilds in <200ms

tokens = query.lower().split()
bm25_scores = bm25_index.get_scores(tokens)
bm25_results = top_100_by_score(bm25_scores)   # (doc_id, score)
```

**Dense retrieval:**
```python
# ChromaDB with cosine similarity metric
# Query embedded with text-embedding-3-large (3072 dims)
# Pre-computed document embeddings stored in ChromaDB

query_embedding = openai.embed(query)  # ~200ms
dense_results = chroma.query(
    query_embeddings=[query_embedding],
    n_results=100,
    where={"domain": {"$in": relevant_domains}}  # optional domain pre-filter
)
# Returns (doc_id, cosine_distance) → convert to (doc_id, 1 - distance)
```

### 6.2 Stage 2 — RRF Fusion

```python
K = 60  # standard constant — dampens top-position advantage

def rrf(bm25_ranked, dense_ranked):
    scores = {}
    for rank, (doc_id, _) in enumerate(bm25_ranked, start=1):
        scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (K + rank)
    for rank, (doc_id, _) in enumerate(dense_ranked, start=1):
        scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (K + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# A document ranked #1 by BM25 and #2 by dense scores:
# 1/(60+1) + 1/(60+2) = 0.01639 + 0.01613 = 0.03252
# Beats a document ranked #1 by only one method:
# 1/(60+1) + 0 = 0.01639
```

Returns top 20 by RRF score.

### 6.3 Stage 3 — Cohere Cross-Encoder Rerank

```python
# Reads query + document together (cross-encoder)
# Qualitatively different from bi-encoder: understands specific phrase relationships
# Run on top 20 only — ~300ms, ~$0.001 per call

response = cohere.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[doc_texts[id] for id in top20_ids],
    top_n=8
)
# Returns (doc_id, relevance_score) sorted descending
# Fallback: if Cohere unavailable, use RRF order[:8]
```

Cache results by hash(query + doc_ids) with 5-minute TTL to reduce cost during development.

### 6.4 Stage 4 — LTR Pre-filter (Individual Docs)

```python
# Optional: LTR scores individual docs before strategy composition to boost candidate pool quality
# Lightweight pass — filters out docs with historically poor performance for this customer
# The main LTR re-ranking happens at Stage 7 on composed variant bundles

features = extract_features(query, candidate_doc, customer_context)
ltr_scores = ltr_model.predict(features)  # per-customer model or global
# Remove bottom 25% by LTR score — don't compose with docs that have consistently lost
candidates = [doc for doc in candidates if ltr_scores[doc.id] > percentile_25]
```

### 6.5 Stage 5 — MMR Diversity Selection

```python
# Maximal Marginal Relevance
# λ=0.5 balances relevance vs diversity
# Similarity computed on embedding vectors (already in memory from dense retrieval)

def mmr(candidates, embeddings, n, lambda_=0.5):
    selected = []
    remaining = list(candidates)

    for _ in range(n):
        best, best_score = None, float('-inf')
        for doc_id, relevance in remaining:
            if not selected:
                score = relevance
            else:
                max_sim = max(cosine(embeddings[doc_id], embeddings[s])
                              for s in selected)
                score = lambda_ * relevance - (1 - lambda_) * max_sim
            if score > best_score:
                best, best_score = doc_id, score
        selected.append(best)
        remaining = [(d, r) for d, r in remaining if d != best]

    return selected
```

`λ=1.0` → pure relevance (top-k). `λ=0.0` → pure diversity. `λ=0.5` is the default. Expose as a parameter — a customer requesting 2 variants may want higher relevance (λ=0.7), a customer requesting 5 wants more diversity (λ=0.4).

### 6.6 Stage 6 — Corpus-Anchored Composition

The retrieval pipeline (stages 1-5) provides the candidate pool. This stage selects one doc per domain per variant, producing three bundles that are guaranteed to differ visually — without pre-defining which directions to explore.

#### 6.6.1 Why Not Pre-Defined Strategy Buckets

The previous approach used hardcoded strategies (trust/urgency/simplicity) to bias retrieval, with a `STRATEGY_APPLICABILITY` lookup table mapping page context to strategy names. This has two critical failure modes:

1. **Convergence** — "trust" and "simplicity" both target Minimalism docs. In a corpus with 86 style docs, two strategies frequently retrieve the same top doc. The `select_best_match` fallback doesn't prevent visually similar results.

2. **Pigeonholing** — For fintech checkout, "bold" may be actively wrong. Pre-assigning strategies from a lookup table generates variants a designer would never ship, wasting A/B test budget on directions the corpus itself knows are inappropriate.

The correct mental model: strategies are a *label* for what you found, not a *constraint* on what you look for.

#### 6.6.2 The Anchor: `ui_reasoning` Corpus Domain

`ui-reasoning.csv` (161 rows, one per product/page category) is serialized into 161 prose docs and embedded into ChromaDB alongside all other corpus domains. Each doc looks like:

```
UI Reasoning: Fintech/Crypto. Style Priority: Minimalism + Accessible & Ethical.
Color Mood: Navy + Trust Blue + Gold. Typography Mood: Professional + Trustworthy.
Recommended Pattern: Trust & Authority. Key Effects: Smooth state transitions.
```

Finding the right rule is a **retrieval problem**, not a string matching problem. `select_variant_1` calls `pipeline.query_with_embedding(domain_filter="ui_reasoning", n_results=1)` — the same BM25 + dense pipeline used for every other domain. The top result is the best matching rule for this query. Its prose text is then parsed to extract `Style Priority`, `Color Mood`, `Typography Mood` etc., which become the keyword hints for `_select_best_match`.

**Why retrieval over CSV string matching:** The old approach used exact/partial/keyword string matching on `UI_Category` (e.g. "B2B SaaS checkout"). This is fragile — it misses semantic equivalents ("enterprise software" doesn't match "SaaS (General)"). The retrieval pipeline handles this naturally: "enterprise software checkout" embeds close to "SaaS (General)" in vector space and BM25 catches lexical overlap. The same system that finds the best style doc finds the best rule doc.

```python
def select_variant_1(candidates_per_domain, product_type, page_context, pipeline, embedding):
    # Find best matching rule via retrieval — not string matching
    rule_candidates = pipeline.query_with_embedding(
        f"{product_type} {page_context}",
        embedding,                        # shared embedding, no extra OpenAI call
        domain_filter="ui_reasoning",
        n_results=1,
    )
    rule_fields = _parse_rule_doc(rule_candidates[0].text)
    # rule_fields = {"Style Priority": "Minimalism + Flat Design", "Color Mood": "Trust blue + Accent contrast", ...}

    docs = {}
    for domain, candidates in candidates_per_domain.items():
        keywords = _extract_keywords_from_rule(rule_fields, domain)
        docs[domain] = _select_best_match(candidates, keywords)
        # _select_best_match: +10 metadata style_category match, +3 any metadata, +1 prose text
    return docs, rule_fields
```

#### 6.6.3 Variants 2 and 3: Maximum Embedding Distance

After Variant 1 claims its docs, Variants 2 and 3 are chosen to maximize visual diversity. "Diversity" is defined in embedding space — not by category name or keyword. A doc is diverse if it is semantically far from already-chosen docs.

```python
def select_by_max_distance(candidates, chosen_docs, used_doc_ids):
    """
    For each domain, pick the candidate most distant in embedding space
    from all already-chosen docs. Enforces uniqueness across variants.
    """
    docs = {}
    for domain, domain_candidates in candidates.items():
        chosen_emb = chosen_docs[domain].embedding  # from ChromaDB

        best_doc, best_dist = None, -1.0
        for c in domain_candidates:
            if c.doc_id in used_doc_ids:
                continue
            dist = 1.0 - cosine_similarity(c.embedding, chosen_emb)
            if dist > best_dist:
                best_dist = dist
                best_doc = c

        docs[domain] = best_doc
        used_doc_ids.add(best_doc.doc_id)
    return docs

def compose_variants(query, product_type, page_context, customer_context):
    # Step 1: retrieve top 8 per domain (one shared embedding)
    embedding = pipeline._embed(query)
    candidates = {
        domain: pipeline.query_with_embedding(query, embedding, domain_filter=domain, n_results=8)
        for domain in ["style", "color", "typography", "landing", "ux"]
    }

    used_doc_ids = set()

    # Step 2: Variant 1 — ui-reasoning.csv anchor
    v1_docs = select_variant_1(candidates, product_type, page_context)
    used_doc_ids.update(d.doc_id for d in v1_docs.values())

    # Step 3: Variant 2 — max distance from V1
    v2_docs = select_by_max_distance(candidates, v1_docs, used_doc_ids)
    used_doc_ids.update(d.doc_id for d in v2_docs.values())

    # Step 4: Variant 3 — max distance from V1 and V2 combined
    combined_chosen = {
        domain: [v1_docs[domain], v2_docs[domain]] for domain in candidates
    }
    v3_docs = select_by_max_distance_from_multiple(candidates, combined_chosen, used_doc_ids)

    # Step 5: Apply brand constraints, assign post-hoc labels, build attribution
    variants = []
    for docs in [v1_docs, v2_docs, v3_docs]:
        docs = apply_brand_constraints(customer_context, docs)
        label = assign_strategy_label(docs)  # nearest strategy embedding
        variants.append(ComposedVariant(
            label=label,
            hypothesis=STRATEGY_HYPOTHESES[label],
            docs=docs,
            source_doc_ids=[d.doc_id for d in docs.values()],
        ))

    return variants
```

**Example output:**

```
Query: "checkout button for B2B fintech"
ui-reasoning.csv match: "Fintech/Crypto" → Style: Minimalism + Accessible & Ethical,
                                            Color: Navy + Trust Blue + Gold

Variant 1 (label: "minimal", assigned post-hoc):
  Style:      Minimalism (Flat Design)     ← closest to reasoning keywords
  Color:      Navy + Trust blue            ← closest to "Navy + Trust Blue + Gold"
  Typography: Professional, clear, 16px   ← closest to "Professional + Trustworthy"
  Layout:     Hero + Features + Trust CTA ← closest to "Trust & Authority" pattern
  UX:         Security badges, SOC2       ← closest to trust-focused UX guidelines

Variant 2 (label: "layered", assigned post-hoc):
  Style:      Glassmorphism               ← most distant from Minimalism in embedding space
  Color:      Gradient aurora + accent    ← most distant from Navy + Trust Blue
  Typography: Modern geometric            ← most distant from "clear readable"
  Layout:     Card-based layered sections ← most distant from linear feature list
  UX:         Animated transitions        ← most distant from static trust signals

Variant 3 (label: "bold", assigned post-hoc):
  Style:      Vibrant & Block-based       ← most distant from both V1 and V2
  Color:      High contrast primary       ← most distant from both prior colors
  Typography: Bold impactful headline     ← most distant from both prior typography
  Layout:     Above-fold CTA, sticky     ← most distant from both prior layouts
  UX:         Single action, urgency cue ← most distant from both prior UX docs
```

All three variants are fully attributed. V1 is grounded in expert knowledge for this context. V2 and V3 are the corpus's answer to "what's most different?" — not a hardcoded bucket.

#### 6.6.4 Post-Hoc Strategy Labels

Strategy labels exist for human readability and hypothesis writing — not for retrieval. After composition, each bundle is labeled by finding the nearest strategy profile in embedding space:

```python
STRATEGY_PROFILES = {
    "minimal": "clean whitespace minimalism flat design system fonts single CTA",
    "bold":    "vibrant high contrast motion-driven energetic warm colors bold typography",
    "layered": "glassmorphism depth frosted gradient modern geometric premium craft",
}

# Pre-compute at startup (3 OpenAI calls, cached forever)
STRATEGY_EMBEDDINGS = {k: embed(v) for k, v in STRATEGY_PROFILES.items()}

STRATEGY_HYPOTHESES = {
    "minimal": "Reducing cognitive load and distraction increases conversion — single focal point, maximum whitespace, one action.",
    "bold":    "High visual energy and contrast drives immediate action — warm palette, impactful type, above-fold CTA.",
    "layered": "Premium visual craft signals quality and builds trust — depth, gradients, and refined type create a memorable first impression.",
}

def assign_strategy_label(docs):
    bundle_text = " ".join(d.text for d in docs.values())
    bundle_emb = embed(bundle_text)
    return max(STRATEGY_EMBEDDINGS, key=lambda k: cosine_similarity(bundle_emb, STRATEGY_EMBEDDINGS[k]))
```

The label tells the user *what direction this variant represents*. It does not determine what docs were selected.

#### 6.6.5 Adaptation Over Time

On Day 0, Variant 1 is pure `ui-reasoning.csv` and Variants 2–3 are pure embedding distance. As selection and edit data accumulates, the selection step blends in win rates:

```python
# After enough data (≥20 selection events for this context):
score = distance_from_chosen * 0.6 + doc.win_rate * 0.4
```

Docs that have appeared in winning bundles get a preference boost while diversity is still enforced. Once LTR has ≥30 labeled examples, it takes over re-ranking entirely. The corpus-anchored composition stays fixed — LTR operates on the output order, not the selection logic.

#### 6.6.6 Why Not LLM Composition

| | LLM composition | Corpus-anchored composition |
|---|---|---|
| Fake diversity | Common ("clean" vs "minimal") | Impossible — enforced by embedding distance |
| Incoherent bundles | Possible (brutalist + glassmorphism) | Prevented — each doc is independently the best match for its domain |
| Hallucination | LLM can invent patterns not in corpus | Can't — only selects from retrieved candidates |
| Vague hypotheses | "this is more engaging" | Pre-written per strategy label, specific and testable |
| Speed | 1-3s (LLM call) | <200ms (5 domain searches + distance scoring) |
| Cost per call | $0.01-0.05 | ~$0.001 (1 embedding call) |
| Deterministic | No | Yes — same input, same output |
| Attribution | Hard (LLM may ignore docs) | Automatic — every doc maps to a retrieval score |

**Role of LLM:** Used downstream to **generate code/markup** for each variant (React, Tailwind, etc.), but the design decisions — which style, colors, layout — are deterministic.

### 6.7 Stage 7 — LTR Re-ranking of Composed Variants

After composition, LTR re-ranks the composed variant bundles (not individual docs) by predicted win probability. Features are extracted per-bundle:

```python
def extract_bundle_features(variant_bundle, customer_id, component_type, page_context):
    return {
        # Aggregate retrieval quality across source docs
        "avg_cohere_score":     mean([doc.cohere_score for doc in variant_bundle.source_docs]),
        "min_cohere_score":     min([doc.cohere_score for doc in variant_bundle.source_docs]),

        # Label-level signals (strategy label is post-hoc — treated as a feature, not a constraint)
        "strategy_label":               variant_bundle.label,  # "minimal" / "bold" / "layered"
        "global_label_win_rate":        float,  # cross-customer win rate for this label

        # Attribute-level signals (aggregated across the bundle's source docs)
        "style_doc_win_rate":           float,  # global_win_rate from ChromaDB metadata
        "color_doc_win_rate":           float,
        "typo_doc_win_rate":            float,
        "style_color_combo_win_rate":   float,  # how often this style+color pairing wins globally
        "avg_doc_win_rate":             float,  # mean across all 5 domain docs

        # Context match
        "component_type_match":  bool,  # do source docs target this component type?
        "page_context_match":    bool,
    }
```

LTR scores full variant bundles, not individual building blocks. This means LTR can learn "urgency strategies win on checkout pages for this customer" and push urgency-composed variants higher.

---

## 7. Learning-to-Rank System

### 7.1 Feature Engineering

Features per (context, candidate) pair:

```python
def extract_features(query, candidate_doc, customer_id, component_type, page_context):
    return {
        # Retrieval quality signals
        "bm25_score":           float,   # raw BM25 score
        "dense_score":          float,   # cosine similarity
        "rrf_score":            float,   # RRF merged score
        "cohere_score":         float,   # cross-encoder relevance

        # Document attributes (categorical → one-hot or embedding)
        "domain":               str,     # style / color / typography / etc.
        "style_category":       str,     # Glassmorphism / Minimalism / etc.
        "complexity":           str,     # Low / Medium / High
        "performance":          str,
        "accessibility":        str,

        # Customer historical signals for this document's attributes
        "customer_style_win_rate":    float,  # historical win rate for this style
        "customer_style_appearances": int,    # how many times tested
        "customer_style_confidence":  float,  # decayed confidence score

        # Context match signals
        "component_type_match": float,   # did this doc appear in experiments with same component_type?
        "page_context_match":   float,   # same for page_context
        "query_domain_match":   float,   # does doc domain match detected query domain?

        # Global signals
        "global_win_rate":      float,   # win rate across all customers
        "global_appearances":   int,

        # Temporal
        "days_since_last_win":  float,   # recency of wins
    }
```

### 7.2 Training Data Format

```python
# Pairwise format for LambdaRank
# group = all candidates for one query/context
# label = relevance score (higher = more relevant/preferred)

# From selection event: customer picked B over A
# → B gets label 1.0, each rejected variant gets label 0.0
# → weight = 0.3 (selection is weaker signal than A/B result)

# From A/B result (winner):
# → winning variant: label = 1.0 * test_confidence
# → losing variant: label = 0.0
# → weight = normalized_cvr_lift * test_confidence (typically 0.5-1.0)

# From A/B result (staged):
# Stage 1 (day 0): immediate metrics, weight_multiplier = 0.6
# Stage 2 (day 7): 7-day metrics, weight_multiplier = 0.25
# Stage 3 (day 30): 30-day metrics, weight_multiplier = 0.15

import lightgbm as lgb

train_data = lgb.Dataset(
    X,           # feature matrix (n_examples × n_features)
    label=y,     # relevance labels
    group=groups, # query groups (how many examples per query)
    weight=w,    # per-example weights
)

model = lgb.train(
    params={
        "objective":     "lambdarank",
        "metric":        "ndcg",
        "ndcg_eval_at":  [3, 5],
        "num_leaves":    31,
        "learning_rate": 0.05,
    },
    train_set=train_data,
    num_boost_round=100,
)
```

### 7.3 Model Architecture

**Per-customer vs global:**

- New customer (< 20 training examples): use global model, no per-customer model
- Growing customer (20-100 examples): global model with customer feature vector appended
- Mature customer (> 100 examples): customer-specific model fine-tuned from global

**Retraining cadence:** nightly batch job. LambdaRank on ~2000 examples trains in seconds. No streaming update needed.

**Cold start (global baseline):**
```python
# Global model trained on all customers' data combined
# Customer features zeroed out for new customers
# Produces recommendations informed by aggregate patterns
# Gradually shifts to customer-specific as data accumulates
```

### 7.4 Hierarchical Fallback

When a customer-specific model doesn't have enough data for a specific context:

```
customer + component_type + page_context  (min 20 examples)
  → customer + component_type             (min 10 examples)
    → customer only                       (min 5 examples)
      → global baseline
```

If a customer has strong signal for "checkout CTA" but nothing for "pricing hero", the pricing hero query uses global model while checkout CTA uses customer-specific.

### 7.5 Confidence Decay

```python
import math

HALF_LIFE_DAYS = 90.0

def decayed_label(original_label, days_since_experiment):
    decay = math.pow(0.5, days_since_experiment / HALF_LIFE_DAYS)
    return original_label * decay

# A 90-day-old experiment contributes 50% of its original weight
# A 180-day-old experiment contributes 25%
# Prevents stale signals from dominating after rebrand or strategy shift
```

---

## 8. Attribution System

### 8.1 What Gets Stored

At generation time, before returning variants to the user:

```python
attribution = {
    "retrieval": {
        "bm25_rank":    {doc_id: rank for each candidate},
        "dense_rank":   {doc_id: rank for each candidate},
        "rrf_score":    {doc_id: score},
        "cohere_score": {doc_id: score},
    },
    "ltr": {
        "model_version":     "2026-03-19-acme-corp",
        "feature_importance": {
            "customer_style_win_rate": 0.28,
            "cohere_score":            0.22,
            "component_type_match":    0.18,
            "global_win_rate":         0.12,
            # ...
        },
        "top_contributing_experiments": [
            {
                "experiment_id":   "exp-291",
                "customer_id":     "acme-corp",  # anonymized for other customers
                "component_type":  "cta-button",
                "page_context":    "checkout",
                "result":          "+23% CVR",
                "style":           "minimalism",
                "weight_contributed": 0.18,
                "days_ago":        14
            },
            # up to 5 experiments
        ],
        "selection_contributions": 8,   # how many selection events trained this
    },
    "corpus_basis": [
        "Minimalism: clean whitespace, no decorative elements, high contrast CTA",
        "Checkout patterns: single primary action, trust signals near CTA",
        "B2B SaaS: professional typography, data-forward layout",
    ],
    "confidence":          0.84,
    "low_confidence_flag": False,
    "cold_start_flag":     False,
}
```

### 8.2 User-Facing Explanation

```
Variant 1 — Label: minimal (ui-reasoning.csv anchor for "Fintech/Crypto + checkout")
Confidence: 84%

Hypothesis: Reducing cognitive load and distraction increases conversion — single
focal point, maximum whitespace, one action. Grounded in expert recommendation for
fintech checkout contexts.

Why this combination was recommended:
• Variant 1 is anchored on ui-reasoning.csv row "Fintech/Crypto":
  Style_Priority=Minimalism+Accessible & Ethical, Color_Mood=Navy+Trust Blue+Gold
• Minimalism style doc selected 8 times over alternatives in similar fintech contexts
• Experiment exp-291 (+23% CVR, checkout CTA, 14 days ago) used the same style doc
• Navy+Trust Blue color palette has 67% win rate on B2B fintech checkout globally
• LTR model weights: doc win rate (28%), combo win rate (22%), context match (18%)

Design basis (corpus-anchored selection):
• Style: Minimalism (Accessible & Ethical) — style_styles_0 — closest to ui-reasoning keywords
• Color: Navy + Trust Blue + Gold — color_colors_14 — closest to "Navy + Trust Blue + Gold"
• Typography: Professional, clear hierarchy — typography_typography_3 — closest to "Professional + Trustworthy"
• Layout: Hero + Features + Trust CTA — landing_landing_2 — closest to "Trust & Authority" pattern
• UX: Security badges, SOC2, low-commitment CTA — ux_ux-guidelines_11

⚠ Animation was auto-adjusted to 'none' based on your 4 previous swap_doc edits.
```

### 8.3 Referenceability for Single Mode

When `mode="single"`, use Wilson lower bound to pick the most conservative confident recommendation:

```python
from scipy.stats import beta as beta_dist

def wilson_lower_bound(wins, total, confidence=0.95):
    if total == 0:
        return 0.0
    # Lower bound of Beta credible interval
    alpha = 1 - confidence
    return beta_dist.ppf(alpha / 2, wins + 1, total - wins + 1)

# Combine with LTR score
# final = 0.7 * ltr_score + 0.3 * wilson_lower_bound(wins, total)
# Wilson penalizes high win rates based on tiny samples
```

---

## 9. Human-in-the-Loop Flow

### 9.1 Selection Signal

```
Customer receives 3 variants: A (Glassmorphism), B (Minimalism), C (Flat Design)
Customer picks B

→ log_selection(selected=B, rejected=[A, C])

Pairwise training examples created:
  (context, B_features, A_features) → B > A, weight=0.3
  (context, B_features, C_features) → B > C, weight=0.3

Written to ltr_training_examples table
Included in next nightly retraining
```

### 9.2 Edit Signal

Edit events are **more granular than selection events** — they produce domain-level pairwise labels, not just bundle-level ones. Three edit types carry different signal weight.

**Type 1: swap_doc** — user replaced the corpus doc for a specific domain (cleanest signal)
```
Customer picks Variant 1 (Minimalism bundle) but swaps the color doc:
  domain: "color"
  original_doc_id: "color_colors_4"  (Navy + Trust Blue)
  replacement_doc_id: "color_colors_22"  (Vibrant gradient)

→ log_edit(domain="color", edit_type="swap_doc",
           original_doc_id="color_colors_4",
           replacement_doc_id="color_colors_22")

Domain-level pairwise labels created:
  (context, domain=color, replacement_doc) > (context, domain=color, original_doc)
  weight = 0.8 (if total edit_distance ≤ 2)

Corpus doc win rates updated:
  color_colors_22.global_win_rate ++
  color_colors_4.global_win_rate not credited (was rejected)
```

**Type 2: freeform** — user typed their own value, not in corpus (corpus gap signal)
```
Customer edits color to #1A3A5C — not in any corpus doc

→ log_edit(domain="color", edit_type="freeform",
           original_doc_id="color_colors_4",
           replacement_doc_id=NULL)

Signals written:
  Negative label on color_colors_4 for this context
  Gap candidate logged: {domain: "color", page_type: "checkout",
                         industry: "fintech", value: "#1A3A5C", count: 1}

Nightly job: if gap candidate appears ≥3 times across different users
  → flag as corpus gap for human review
  → if confirmed, add new corpus row and re-embed
```

**Type 3: partial** — user kept the doc but tweaked a parameter (soft negative)
```
Customer keeps Glassmorphism style but reduces blur intensity

→ log_edit(domain="style", edit_type="partial",
           original_doc_id="style_styles_7",
           replacement_doc_id="style_styles_7")  # same doc

Signals written:
  Soft negative on style_styles_7 at 0.3× weight
  (doc was right direction, not perfectly calibrated)
```

**Auto-apply patterns** (unchanged from before):
```
edit_events table updated
auto_apply_patterns checked:
  animation edits for acme-corp: 4 times → add to auto_apply_patterns
  border_radius edits: 2 times → not yet (threshold = 3)
```

### 9.3 A/B Result Attribution Rule

**Critical:** always use the final edited attributes for signal attribution, never the original recommendation.

```python
# Wrong — credits recommendation that wasn't actually tested
ingest_ab_result(variant_attributes=original_recommendation_attrs, winner=True)

# Correct — credits what was actually in production
ingest_ab_result(variant_attributes=final_edited_attrs, winner=True)
```

This prevents false credit accumulation. A customer who completely redesigns a variant (edit_distance > 5) before testing — and it wins — should not significantly boost your recommendation's signal. The `confidence_weight` handles this:

```python
weight = normalized_cvr_lift * test_confidence * edit_confidence_weight(edit_distance)
# edit_confidence_weight: ≤2 edits=0.8, ≤5=0.4, >5=0.1
```

---

## 10. Multi-Metric Ingestion

### 10.1 Staged Approach

```
Day 0 — Stage 1 (60% of final signal):
  Metrics: CVR, bounce_rate, scroll_depth, rage_click_rate
  Available immediately from PostHog/GA4
  Composite score computed, written to ltr_training_examples at 0.6× weight
  Stages 2 and 3 scheduled

Day 7 — Stage 2 (25% of final signal):
  Metrics: return_7d, activation_rate, support_ticket_rate
  Check: does stage 2 direction match stage 1?
  If reversal detected: write correction, flag for review
  Written at 0.25× weight

Day 30 — Stage 3 (15% of final signal):
  Metrics: retention_d30, feature_adoption
  Final signal written at 0.15× weight
  Experiment marked fully_ingested = 1
```

### 10.2 Composite Score

```python
PAGE_INTENT = {
    "checkout":   "conversion",   # lower time_to_convert = better
    "landing":    "conversion",
    "pricing":    "conversion",
    "dashboard":  "efficiency",   # lower task_completion_time = better
    "onboarding": "engagement",   # higher time on page = better (up to a point)
    "docs":       "engagement",
}

### Extensible Metric Registry (Critical Design Principle)

**Never hardcode metric names into composite score, LTR features, or any downstream logic.**
All metric handling must loop over the registry — nothing in the system should know specific
metric names. This ensures new metrics from new data sources can be added with zero code changes.

```python
# ──────────────────────────────────────────────────────────────
# METRIC_REGISTRY is the single source of truth for all metrics.
# Adding a new metric from ANY source = one dict entry + one adapter.
# No migrations, no composite score changes, no LTR pipeline changes.
# ──────────────────────────────────────────────────────────────

METRIC_REGISTRY = {
    # Stage 1 — Day 0
    "cvr":              {"weight": +0.30, "direction": "up",   "stage": 1, "source": "posthog"},
    "bounce_rate":      {"weight": -0.15, "direction": "down", "stage": 1, "source": "posthog"},
    "scroll_depth":     {"weight": +0.05, "direction": "up",   "stage": 1, "source": "posthog"},
    "rage_click_rate":  {"weight": -0.10, "direction": "down", "stage": 1, "source": "posthog"},

    # Stage 2 — Day 7
    "return_7d":        {"weight": +0.20, "direction": "up",   "stage": 2, "source": "posthog"},
    "activation_rate":  {"weight": +0.15, "direction": "up",   "stage": 2, "source": "posthog"},

    # Stage 3 — Day 30
    "retention_d30":    {"weight": +0.15, "direction": "up",   "stage": 3, "source": "posthog"},
    "feature_adoption": {"weight": +0.10, "direction": "up",   "stage": 3, "source": "posthog"},

    # ── Future sources: just add rows ──
    # "nps_score":         {"weight": +0.10, "direction": "up",   "stage": 2, "source": "delighted"},
    # "support_tickets":   {"weight": -0.08, "direction": "down", "stage": 2, "source": "intercom"},
    # "mrr_expansion":     {"weight": +0.20, "direction": "up",   "stage": 3, "source": "stripe"},
}
```

**To add a new metric from a new data source, you do exactly 3 things:**

1. **Add a source adapter** — one class that fetches the metric value:
```python
class IntercomAdapter:
    """Each source adapter implements one method."""
    def get_metrics(self, experiment_id: str, date_range: tuple) -> dict:
        # Returns: {"support_tickets": 14}
        # Keys must match METRIC_REGISTRY keys
        tickets = self.client.conversations.search(experiment_id, date_range)
        return {"support_tickets": len(tickets)}
```

2. **Add a row to METRIC_REGISTRY** — weight, direction, stage, source name.

3. **Nothing else.** Everything downstream reads from the registry:

```python
def composite_score(metrics: dict, baseline_metrics: dict, page_intent: str) -> float:
    """Loops over registry — never references specific metric names."""
    score = 0.0
    intent_filter = PAGE_INTENT_METRIC_FILTER.get(page_intent, list(METRIC_REGISTRY.keys()))

    for metric_name, config in METRIC_REGISTRY.items():
        if metric_name not in intent_filter:
            continue
        value = metrics.get(metric_name)
        baseline = baseline_metrics.get(metric_name)
        if value is None or baseline is None or baseline == 0:
            continue

        normalized = (value - baseline) / abs(baseline)
        normalized = max(-1.0, min(1.0, normalized))

        # Flip sign for "down" metrics (lower = better)
        if config["direction"] == "down":
            normalized *= -1

        score += config["weight"] * normalized
    return score
```

**Why this matters for LTR:** New metrics become new features in training data automatically. LightGBM handles missing features natively (splits on available data only), so old training examples without the new metric remain valid. The model learns the new signal's importance as data accumulates — no retraining pipeline changes.

**Why `metrics_json` in signals.db is a JSON blob, not fixed columns:** Adding a new metric never requires a database migration. The schema `metrics_json TEXT` stores `{"cvr": 0.12, "bounce_rate": 0.45, "support_tickets": 3}` — any key-value pair, any source, any time.
```

### 10.3 Extended SDK Metric Registry

The full set of signals available across PostHog and GA4, organized by availability and what they measure. Not all metrics are used in every composite score — each has a `page_intent` scope that controls when it's relevant.

#### Interaction Quality (Stage 1 — Day 0)

| Metric | Source | Signal Type | Page Intent | Weight |
|--------|--------|-------------|-------------|--------|
| `cvr` | PostHog/GA4 | Primary outcome | conversion | +0.30 |
| `bounce_rate` | PostHog/GA4 | Exit quality | all | -0.15 |
| `scroll_depth` | PostHog | Engagement depth | engagement | +0.05 |
| `rage_click_rate` | PostHog | Frustration | all | -0.10 |
| `dead_click_rate` | PostHog | Confusion (click w/ no response) | all | -0.08 |
| `u_turn_rate` | PostHog | Immediate regret (back within 5s) | conversion | -0.07 |
| `time_to_first_interaction` | PostHog | Above-fold clarity | landing | -0.05 (lower = better) |
| `cta_click_rate` | PostHog | Primary CTA engagement | conversion | +0.12 |
| `form_start_rate` | PostHog | Intent signal | conversion | +0.08 |
| `form_abandonment_rate` | PostHog | Friction signal | conversion | -0.12 |
| `error_encounter_rate` | PostHog | Technical friction | all | -0.10 |
| `hover_time_above_fold` | PostHog | Visual engagement | landing | +0.04 |
| `exit_rate` | GA4 | Page-level exit (vs bounce = session) | all | -0.08 |
| `avg_session_duration` | GA4 | Engagement breadth | engagement | +0.06 |
| `pages_per_session` | GA4 | Exploration depth | engagement | +0.05 |

#### Retention and Activation (Stage 2 — Day 7)

| Metric | Source | Signal Type | Page Intent | Weight |
|--------|--------|-------------|-------------|--------|
| `return_7d` | PostHog | Early retention | all | +0.20 |
| `activation_rate` | PostHog | Completed key action (setup, first purchase, etc.) | onboarding | +0.15 |
| `support_ticket_rate` | PostHog/CRM | Confusion / UX failure | all | -0.10 |
| `nps_survey_response` | PostHog | Qualitative satisfaction proxy | all | +0.08 |
| `trial_to_paid_cvr` | PostHog | Business-critical | saas | +0.25 |
| `upgrade_click_rate` | PostHog | Expansion intent | saas/pricing | +0.12 |
| `new_vs_returning_split` | GA4 | Audience quality signal (segment only, not scored) | — | segmentation |
| `traffic_source_cvr_delta` | GA4 | Paid vs organic CVR gap (flag if variant only works for warm users) | conversion | audit |

#### Long-term Value (Stage 3 — Day 30)

| Metric | Source | Signal Type | Page Intent | Weight |
|--------|--------|-------------|-------------|--------|
| `retention_d30` | PostHog | Deep retention | all | +0.15 |
| `feature_adoption` | PostHog | Product engagement breadth | saas/dashboard | +0.10 |
| `expansion_mrr_delta` | PostHog/billing | Revenue expansion | saas | +0.20 |
| `churn_rate_delta` | PostHog/billing | Retention health | saas | -0.20 |
| `ltv_proxy` | derived | 30d revenue ÷ acquisition cost | ecommerce/saas | +0.15 |
| `user_cohort_d30_retention` | GA4 | Cohort-level stickiness | all | +0.12 |

#### Page-Intent-Specific Routing

Not every metric should contribute to every composite score. Apply this routing table before computing:

```python
PAGE_INTENT_METRIC_FILTER = {
    "conversion": [
        "cvr", "bounce_rate", "rage_click_rate", "dead_click_rate",
        "u_turn_rate", "cta_click_rate", "form_start_rate",
        "form_abandonment_rate", "time_to_first_interaction",
        "return_7d", "trial_to_paid_cvr",
    ],
    "engagement": [
        "scroll_depth", "avg_session_duration", "pages_per_session",
        "hover_time_above_fold", "rage_click_rate", "dead_click_rate",
        "return_7d", "nps_survey_response",
    ],
    "efficiency": [
        "task_completion_time",  # lower = better, flip sign
        "dead_click_rate", "error_encounter_rate",
        "support_ticket_rate", "form_abandonment_rate",
    ],
    "onboarding": [
        "activation_rate", "time_to_first_interaction",
        "form_abandonment_rate", "support_ticket_rate",
        "feature_adoption", "return_7d",
    ],
    "retention": [
        "retention_d30", "churn_rate_delta", "expansion_mrr_delta",
        "feature_adoption", "nps_survey_response", "ltv_proxy",
    ],
}
```

Metrics not in the active `page_intent` filter are collected but excluded from composite scoring for that experiment. They're still stored and available for segmentation.

#### Guardrail-Only Metrics

These are never scored positively — they only trigger disqualification when thresholds are breached:

```python
GUARDRAIL_DEFAULTS = {
    "rage_click_rate":      {"max": 0.08},   # >8% rage clicks = disqualify
    "dead_click_rate":      {"max": 0.12},
    "error_encounter_rate": {"max": 0.03},   # >3% JS errors = disqualify
    "form_abandonment_rate":{"max": 0.75},   # >75% abandonment = disqualify
    "churn_rate_delta":     {"max": 0.02},   # variant increases churn >2% = disqualify
}
```

These guardrails apply globally. Customers can tighten them but not loosen past the defaults.

### 10.4 Customer Optimization Goals

Customers can override default metric weights:

```python
# register_customer() accepts:
optimization_goal = {
    "primary_metric":   "retention_d30",  # up-weight this
    "secondary_metric": "activation_rate",
    "guardrails": {
        "bounce_rate":     {"max": 0.60},  # disqualify if exceeded
        "rage_click_rate": {"max": 0.05},
    }
}
```

Primary metric gets 50% weight, secondary gets 30%, remaining 20% split across others. Guardrail violations disqualify the variant regardless of other scores.

---

## 11. Corpus Design

### 11.1 Starting Corpus

Migrated from ui-ux-pro-max-skill CSV databases:

| Domain | Source File | Rows | Notes |
|--------|-------------|------|-------|
| style | styles.csv | 67 | Core visual design patterns |
| color | colors.csv | 96 | Full semantic token sets |
| typography | typography.csv | 57 | Font pairings + Google Fonts URLs |
| chart | charts.csv | 25 | Chart type recommendations |
| landing | landing.csv | 24 | Page structure + CTA patterns |
| product | products.csv | 161 | Product type → style mappings |
| ux | ux-guidelines.csv | 99 | UX best practices + anti-patterns |
| icons | icons.csv | 50 | Icon libraries |
| react | react-performance.csv | 30 | React/Next.js patterns |
| web | app-interface.csv | 20 | App interface guidelines |
| **component_patterns** | **new** | **0→50+** | **Component-level actionable patterns** |

`google-fonts.csv` (1923 rows) is a lookup table — never embedded, queried by name only.

### 11.2 New Domain: `component_patterns`

The existing CSV data is **system-level** (design system for an app). You need **component-level** patterns for the A/B testing use case (specific, actionable, at the button/card/hero granularity).

Each row:
```
component_type:    checkout-cta-button
page_context:      checkout
pattern_name:      High-Contrast Single CTA
description:       Single primary CTA with high contrast ratio (7:1+), no competing
                   secondary actions visible, trust signals (lock icon, SSL text)
                   immediately below button.
implementation:    bg-orange-500 text-white font-semibold px-8 py-4 rounded-lg
                   cursor-pointer hover:opacity-90 transition-200
                   + trust badge below
best_for:          B2B SaaS checkout, e-commerce payment, subscription signup
win_signals:       High CVR on checkout pages, low rage clicks, high scroll completion
anti_patterns:     Multiple CTAs competing, ghost button as primary, missing trust signals
source:            Design best practices + aggregated A/B patterns
```

This domain is built incrementally from winning A/B variants and session replay analysis.

### 11.3 Prose Serialization

Each row is serialized as labeled prose for better embedding quality:

```python
def serialize_style_row(row):
    synonyms = SYNONYM_MAP.get(row["Style Category"].lower(), [])
    synonym_text = f"Also known as: {', '.join(synonyms)}." if synonyms else ""
    return (
        f"Style: {row['Style Category']}. "
        f"Type: {row['Type']}. "
        f"Keywords: {row['Keywords']}. "
        f"Best For: {row['Best For']}. "
        f"Effects and Animation: {row['Effects & Animation']}. "
        f"Performance: {row['Performance']}. "
        f"Accessibility: {row['Accessibility']}. "
        f"CSS Keywords: {row['CSS/Technical Keywords']}. "
        f"AI Prompt: {row['AI Prompt Keywords']}. "
        f"{synonym_text}"
    )

SYNONYM_MAP = {
    "glassmorphism": [
        "frosted glass", "glass effect", "blur panel", "translucent surface",
        "frosted panel", "glass morphism", "backdrop blur"
    ],
    "minimalism": [
        "minimal", "clean design", "whitespace heavy", "simple ui",
        "flat clean", "stripped back", "uncluttered"
    ],
    "neumorphism": [
        "soft ui", "embossed", "pressed button", "soft shadow ui",
        "skeuomorphic soft", "inset shadow"
    ],
    # ... all 67 styles
}
```

### 11.4 Representation: Hybrid Abstract + Concrete

Each document includes:
1. Abstract design principles (for semantic search quality)
2. Tailwind class examples (for developer actionability)
3. CSS variable names (for design token integration)

```
Style: Glassmorphism.
Principles: Creates depth through translucency; UI elements appear to float
  above a blurred background; communicates modernity and layering.
Tailwind implementation: backdrop-blur-md bg-white/10 border border-white/20
  rounded-xl shadow-lg
CSS variables: --glass-bg: rgba(255,255,255,0.1); --glass-blur: 12px;
  --glass-border: rgba(255,255,255,0.2)
Best for: SaaS dashboards, fintech, modern landing pages with image backgrounds.
Avoid: Low-contrast text on glass; purely decorative blur without purpose.
```

---

## 12. Corpus Quality Validation

### 12.1 Coverage Matrix

Before launch, map every cell of:
```
product_type × component_type × page_context
```

Each cell must have ≥1 relevant document. Use a script to detect gaps:

```python
PRODUCT_TYPES = ["saas", "fintech", "ecommerce", "healthcare", "gaming", ...]
COMPONENT_TYPES = ["hero", "cta", "pricing-card", "navbar", "modal", "form", ...]
PAGE_CONTEXTS = ["landing", "checkout", "dashboard", "onboarding", "settings", ...]

for product in PRODUCT_TYPES:
    for component in COMPONENT_TYPES:
        query = f"{product} {component}"
        results = retrieval_pipeline(query, top_k=5)
        if not results or max_score(results) < COVERAGE_THRESHOLD:
            print(f"GAP: {product} × {component}")
```

Target: zero gaps for the top 20 product types × top 10 component types.

### 12.2 Offline Evaluation Test Set

Curate 50-100 (query, expected_doc_ids) pairs manually:

```python
TEST_QUERIES = [
    {
        "query": "glassmorphism fintech dashboard hero section",
        "relevant_docs": ["style_styles_3", "color_colors_12", "landing_landing_7"],
        "irrelevant_docs": ["style_styles_45"],  # brutalism — wrong style
    },
    {
        "query": "checkout CTA button high contrast B2B",
        "relevant_docs": ["ux_ux_guidelines_8", "style_styles_1", "landing_landing_3"],
    },
    {
        "query": "mobile bottom navigation max items",
        "relevant_docs": ["ux_ux_guidelines_24"],
    },
    # ... 50+ more
]
```

**Metrics:**

```
Recall@5:  fraction of relevant docs appearing in top 5 results
           Target: ≥ 0.80 before launch

MRR (Mean Reciprocal Rank): 1/rank_of_first_relevant_result
           Target: ≥ 0.70

NDCG@5:    normalized discounted cumulative gain, accounts for rank position
           Target: ≥ 0.75

Precision@3: of top 3 results, fraction that are relevant
           Target: ≥ 0.65
```

```python
def evaluate_corpus(test_queries, pipeline):
    recall_scores, mrr_scores, ndcg_scores = [], [], []

    for test in test_queries:
        results = pipeline(test["query"], top_k=5)
        result_ids = [r.doc_id for r in results]

        # Recall@5
        relevant = set(test["relevant_docs"])
        hits = sum(1 for r in result_ids if r in relevant)
        recall_scores.append(hits / len(relevant))

        # MRR
        for rank, doc_id in enumerate(result_ids, 1):
            if doc_id in relevant:
                mrr_scores.append(1.0 / rank)
                break
        else:
            mrr_scores.append(0.0)

        # NDCG@5
        gains = [1 if r in relevant else 0 for r in result_ids]
        dcg = sum(g / math.log2(i + 2) for i, g in enumerate(gains))
        ideal = sum(1 / math.log2(i + 2) for i in range(min(len(relevant), 5)))
        ndcg_scores.append(dcg / ideal if ideal > 0 else 0.0)

    return {
        "Recall@5": mean(recall_scores),
        "MRR":      mean(mrr_scores),
        "NDCG@5":   mean(ndcg_scores),
    }
```

### 12.3 Vocabulary Gap Detection

Queries that score poorly on BM25 but should be relevant indicate vocabulary gaps:

```python
def detect_vocabulary_gaps(test_queries, bm25_index):
    gaps = []
    for test in test_queries:
        bm25_results = bm25_index.search(test["query"], top_k=10)
        relevant_bm25_scores = [
            score for doc_id, score in bm25_results
            if doc_id in test["relevant_docs"]
        ]
        if not relevant_bm25_scores or max(relevant_bm25_scores) < 0.5:
            gaps.append({
                "query": test["query"],
                "expected": test["relevant_docs"],
                "diagnosis": "BM25 miss — add synonyms or enrich document vocabulary"
            })
    return gaps
```

For each gap: identify which query tokens are absent from the relevant document, add them as synonyms.

### 12.4 Inter-Rater Reliability

For the top 20 most common queries, have 2+ human raters independently grade retrieval results as relevant/irrelevant. Measure Cohen's kappa:

```python
from sklearn.metrics import cohen_kappa_score

# Rater 1 grades: [1, 1, 0, 1, 0, ...]  (1=relevant, 0=irrelevant)
# Rater 2 grades: [1, 0, 0, 1, 1, ...]
kappa = cohen_kappa_score(rater1_labels, rater2_labels)
# Target: κ ≥ 0.70 (substantial agreement)
# Below 0.70 means your relevance definition is ambiguous
```

Low kappa indicates the corpus documents are ambiguous — they don't clearly belong to one design concept. Rewrite or split those documents.

### 12.5 Embedding Quality Check

Visualize document clusters to verify semantic coherence:

```python
import umap
import matplotlib.pyplot as plt

# Fetch all embeddings from ChromaDB
all_embeddings = chroma.get(include=["embeddings", "metadatas"])
embeddings_matrix = np.array(all_embeddings["embeddings"])
domains = [m["domain"] for m in all_embeddings["metadatas"]]

# Reduce to 2D
reducer = umap.UMAP(n_components=2, random_state=42)
embeddings_2d = reducer.fit_transform(embeddings_matrix)

# Plot — same-domain documents should cluster together
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
            c=[domain_colors[d] for d in domains], alpha=0.6)
```

**What good looks like:** documents from the same domain cluster together; style documents are near each other; ux documents form their own cluster; there's some overlap between related domains (product ↔ color, style ↔ typography).

**What bad looks like:** all documents mixed randomly, no cluster structure — indicates prose serialization is too uniform (all docs sound the same) or embeddings are low quality.

### 12.6 LTR Offline Evaluation

Before deploying a new LTR model, evaluate on held-out experiments:

```python
# Hold out 20% of experiments as test set (by time — use oldest 80% for train)
# Predict rankings on test set, compare to actual A/B outcomes

# Metric: NDCG@3 on test set
# Target: NDCG@3 > 0.72 before deploying customer-specific model

# Reversal rate: fraction of test experiments where LTR ranked the loser above winner
# Target: reversal rate < 20%
```

---

## 13. Corpus Improvement Loops

### Loop 1 — Query Log Gap Analysis (Weekly)

```python
# Identify queries where customers heavily edited all variants (edit_distance > 5)
failing_queries = db.query("""
    SELECT e.query, e.component_type, e.page_context, AVG(ev.edit_distance) as avg_edit
    FROM experiments e
    JOIN edit_events ev ON e.experiment_id = ev.experiment_id
    GROUP BY e.experiment_id
    HAVING AVG(ev.edit_distance) > 4
    ORDER BY avg_edit DESC
    LIMIT 50
""")
# Each row = a corpus gap. Review and add missing documents.
```

### Loop 2 — Winning Variant Extraction (After Each Batch of 10 Experiments)

```python
# Find A/B winners with moderate edit distance (customer improved on your rec)
winning_edited_variants = db.query("""
    SELECT sv.attributes_json, sv.edited_attrs_json, ar.composite_score,
           e.component_type, e.page_context
    FROM shown_variants sv
    JOIN ab_results ar ON sv.experiment_id = ar.experiment_id
      AND sv.variant_id = ar.variant_id
    WHERE ar.is_winner = 1
      AND sv.edit_distance BETWEEN 2 AND 5  -- improved but not rebuilt
      AND ar.composite_score > 0.15         -- meaningful lift
    ORDER BY ar.composite_score DESC
    LIMIT 20
""")
# Review each — abstract into a component_patterns row
```

### Loop 3 — Document Win Rate Tracking (Monthly)

```python
# Track which corpus documents appear in winning vs losing variants
doc_performance = db.query("""
    SELECT json_each.value as doc_id,
           AVG(ar.is_winner) as win_rate,
           COUNT(*) as appearances
    FROM shown_variants sv,
         json_each(sv.source_docs_json) as json_each
    JOIN ab_results ar ON sv.experiment_id = ar.experiment_id
      AND sv.variant_id = ar.variant_id
    WHERE ar.impressions >= 500
    GROUP BY doc_id
    HAVING appearances >= 5
    ORDER BY win_rate ASC
""")
# Bottom 10% by win_rate → review for revision or deprecation
# Top 10% → promote with boosted global_win_rate in ChromaDB metadata
```

### Loop 4 — Session Replay Review (After Each PostHog Sync)

```python
# Flag losing variant sessions with high frustration signals
sessions_to_review = posthog.get_recordings(filters={
    "experiment_loser": True,
    "rage_clicks__gte": 3,
    "duration__lte": 20,  # bounced quickly
})
# Human reviews these sessions
# Each observed UI failure → new row in ux-guidelines.csv
# "Users clicked the ghost button expecting it to be the primary action"
# → Anti-pattern: ghost button as primary CTA on conversion pages
```

### Loop 5 — Synonym Expansion (After Each Vocabulary Gap Analysis)

```python
# Queries that hit embedding but miss BM25 = vocabulary gap
dense_hits_bm25_miss = [
    q for q in test_queries
    if dense_recall(q) > 0.8 and bm25_recall(q) < 0.4
]
# For each: identify missing tokens, add to SYNONYM_MAP
# Rebuild BM25 index (fast, < 1 minute)
```

### Loop 6 — Quarterly Trend Review

Structured review of:
- Dribbble/Behance: new design patterns gaining traction
- Major product redesigns (Notion, Linear, Vercel, Stripe)
- New CSS capabilities (container queries, CSS nesting, etc.)
- Updated accessibility guidelines (WCAG 2.2, WCAG 3.0 draft)

Each trend producing ≥3 distinct component variations earns a new corpus row.

---

## 14. Integration Architecture

### 14.1 PostHog

```python
class PostHogIntegration:
    def get_experiment_metrics(self, experiment_id) -> dict:
        # Returns: CVR, bounce_rate, scroll_depth, rage_click_rate,
        #          avg_time_to_convert, funnel_drop_off_step

    def get_7day_metrics(self, experiment_id) -> dict:
        # Returns: return_7d, activation_rate, support_ticket_rate

    def get_30day_metrics(self, experiment_id) -> dict:
        # Returns: retention_d30, feature_adoption_rate

    def get_flagged_sessions(self, experiment_id, variant="loser") -> list:
        # Returns: session IDs with rage_clicks > 3 and duration < 20s
        # Used for corpus improvement Loop 4

    def get_funnel_drop_off(self, funnel_id) -> dict:
        # Returns: drop-off rate per step
        # Used to suggest highest-impact page for recommendation
```

PostHog's native experiment system makes this the simplest integration — experiment_id maps 1:1 to your experiment_id.

### 14.2 GA4

```python
class GA4Integration:
    def get_engagement_metrics(self, page_path, date_range) -> dict:
        # Returns: engagement_rate, avg_session_duration,
        #          scrolled_users_pct, bounce_rate

    def get_traffic_source_breakdown(self, page_path) -> dict:
        # Returns: CVR by channel (organic, paid, direct, social)
        # Used to segment signals: paid search users respond differently

    def get_funnel_report(self, funnel_steps) -> dict:
        # Returns: drop-off at each step
        # 24-48hr lag vs PostHog real-time

    def get_user_segment_performance(self, segment_dimension) -> dict:
        # Returns: conversion by user property (plan_type, company_size, etc.)
```

GA4 does not have native A/B experiment tracking (Google Optimize deprecated 2023). Customers must push experiment data as custom events for attribution. This requires customer-side instrumentation — document clearly as a setup requirement.

**GA4 unique value:** traffic source segmentation. If paid search users (cold audience) respond differently to design variants than direct users (warm audience), signals should be stored with traffic_segment dimension:

```python
update_ltr_training_example(
    ...,
    context_features={
        "traffic_segment": "paid_search",
        # other features...
    }
)
```

---

## 15. Limitations and Known Tradeoffs

### 15.1 Feedback Loop Dependency

Adaptability (R2) depends on customers completing the SDK integration and running A/B tests regularly. Historically, 60-70% of B2B customers in API-dependent products don't complete instrumentation. Day 1 value must be strong from corpus alone. Treat adaptability as a retention feature, not an acquisition feature.

**Mitigation:** ensure global baseline (cross-customer signals) provides meaningful lift over pure retrieval from day 1.

### 15.2 Attribution Granularity vs Privacy

Contributing experiment data shown in attribution may reveal competitive information if customers share the same global baseline. Anonymize customer IDs in cross-customer attribution ("8 experiments in similar contexts" not "acme-corp experiment exp-291").

### 15.3 Corpus Staleness

Design trends shift. The existing CSV data was last curated for general UI generation, not A/B testing component patterns. Quarterly reviews and winning variant extraction mitigate this but require ongoing editorial work.

### 15.4 LTR Cold Start

New customers see global model recommendations. Global model is trained on the aggregate of all customers' data — it may not reflect niche product types (e.g., IoT dashboard, AR/VR interface) if those are underrepresented in the customer base. Coverage matrix validation catches systematic gaps; emerging niches are a known blind spot.

### 15.5 Multi-Metric Composite Score Is Opinionated

The composite score formula embeds value judgments (CVR matters more than scroll depth). Customers with different optimization goals must configure this explicitly — don't assume defaults are right. Surface low-confidence warnings when composite score is driven by few metrics.

### 15.6 Cohere Rerank Latency and Cost

~300ms latency per query, ~$0.001 per 20-document rerank. At scale (10,000 queries/day), this is $10/day — negligible. At 100,000 queries/day, $100/day — still manageable but worth monitoring. Implement fallback to RRF order if Cohere is unavailable.

---

## 16. React SDK Expansion Plan

The current React SDK tracks CTR only. This section defines the phased metric expansion required for the composite score and LTR training pipeline.

### SDK Week 1 — Frustration + Engagement Signals

Add via DOM event listeners, no backend changes:

| Metric | Implementation | Why first |
|--------|---------------|-----------|
| `rage_click_rate` | 3+ clicks within 500ms on same element | Highest-signal negative metric, cheap to implement |
| `dead_click_rate` | click with no DOM mutation within 300ms (MutationObserver) | Catches broken/confusing UI that CTR misses |
| `scroll_depth` | IntersectionObserver on 25/50/75/100% markers | Tells you if users even see the variant |

These three + existing CTR give a complete Stage 1 signal set.

### SDK Week 2 — Form + CTA Micro-Funnel

| Metric | Implementation | Why |
|--------|---------------|-----|
| `form_start_rate` | `focus` event on first form field | Intent signal |
| `form_abandonment_rate` | started (focus) but never submitted | Friction detection |
| `cta_click_rate` | click on elements with `data-variant-cta` attribute | Primary conversion action, distinct from general CTR |

Convention: SDK consumers tag their primary CTA with `data-variant-cta` and variant wrapper with `data-variant-id`.

### SDK Week 3 — Return Visit + Session Linkage

| Metric | Implementation | Why |
|--------|---------------|-----|
| `return_7d` | localStorage flag on first visit, check on return within 7d window | Stage 2 anchor metric, measures if design drives re-engagement |
| `session_id` linkage | Generate UUID per session (tab open → 30min idle timeout), attach to all events | Enables `pages_per_session` and `avg_session_duration` computation downstream |

This is the hardest week — requires persistence and identity stitching across page loads.

### SDK Week 4 — Package + Instrument

1. Bundle SDK, write integration guide (< 10 min setup for customer)
2. Dogfood on 1 pilot customer
3. `data-variant-id` attribute convention documented and enforced
4. Events batch-flushed every 5s (or on `beforeunload`) to minimize network overhead
5. SDK size budget: < 8KB gzipped

**End state:** 9 metrics (CTR + 8 new), covering Stage 1 fully and Stage 2 partially. Enough for composite score + LTR training.

**Deferred to month 2:** retention/churn/LTV (requires billing integration), NPS (separate surface), Next.js/Vue/Svelte ports (event logic is framework-agnostic, only hooks wrapper changes).

---

## 17. One-Week MVP

### Goal

Validate one thing: **does retrieval + diversity give customers better variant ideas than they'd come up with themselves?**

No LTR, no A/B ingestion, no PostHog, no customer accounts, no staged metrics. Just: describe a component → get 3 diverse, attributed variants → customer says "I'd test this" or "this is useless."

### What you build

| Day | Deliverable |
|-----|-------------|
| Day 1 | **Ground truth construction.** Pick 15 companies (5 B2B SaaS, 5 e-commerce, 5 marketplace/fintech). Screenshot 3 pages each (landing, pricing, checkout). Decompose into attributes. Write 30-35 queries with expected results → `ground_truth.csv`. |
| Day 2 | **Corpus validation + fixes.** Run ground truth queries against existing CSV. Grade top 5 per query. Compute Recall@5, Precision@3, MRR. Fix gaps: add missing CSV rows, rewrite noisy rows, add synonyms. Write prose serializer. Write 10 `component_patterns` rows. Re-run, confirm ≥75% usefulness. |
| Day 3 | **Retrieval pipeline.** OpenAI embedding + ChromaDB ingestion. BM25 index. Single MCP tool: `recommend_variants(component, page_type, industry)`. BM25 + dense → RRF → MMR per strategy. Deterministic strategy-based composition: 3 strategies × multi-domain search → `select_best_match` per domain → assemble variant bundles. Returns 3 variants with source attribution. Skip Cohere rerank. |
| Day 4 | **Self-test.** Run 5 ground truth queries end-to-end through MCP. Verify results make sense visually. Fix any obvious ranking issues. |
| Day 5 | **User validation.** Put it in front of 3-5 people. Collect feedback. |

### What you skip

| Component | Why safe to skip |
|-----------|-----------------|
| LTR / LambdaRank | No training data yet |
| Cohere rerank | RRF + MMR is sufficient to validate the concept |
| Customer accounts / Supabase tables | Everyone gets global corpus |
| React SDK | Manual feedback via Slack/form at 5 users |
| PostHog / GA4 integration | No experiments running yet |
| Staged metric ingestion | Nothing to ingest |
| Edit signal loop | Ask verbally what they'd change |
| Multimodal embeddings | Text-only proves the core thesis |

### Validation protocol

Run a blind head-to-head with each test user:

```
For each query, show:
  Column A: Orpheus MVP's 3 variants (unlabeled)
  Column B: Raw Claude's 3 variants for the same prompt (unlabeled)

Ask:
  1. Which column would you A/B test? (A / B / both / neither)
  2. What's missing that you expected to see?
  3. Did the explanation make sense? (yes / no)
```

- Question 1 validates R1 (ideation quality)
- Question 2 reveals corpus gaps
- Question 3 validates R3 (referenceability)

**5 users × 3 queries each = 15 data points.**

### Pass/fail criteria

| Signal | Pass | Fail |
|--------|------|------|
| "I'd A/B test 2+ of 3 variants" | ≥60% of responses | <60% → corpus needs work |
| Users prefer Orpheus over raw Claude | Win or tie on majority of queries | Lose consistently → rethink approach |
| "Explanation makes sense" | ≥80% yes | <80% → attribution format needs rework |
| "Nothing missing" or minor gaps | ≤3 unique gap themes across all users | >3 themes → corpus has systematic holes |

**If MVP passes:** proceed to 4-week full build. You also get 15 selection signals (which variants they picked) — first LTR training data for free.

**If MVP fails:** fix corpus and retrieval before adding any complexity. LTR cannot save bad candidates.

---

## 18. 4-Week Build Plan (Compressed Rollout)

The original 10-week rollout compressed to 4 weeks. Key tradeoffs: corpus validation runs in parallel with infra (not gated), PostHog/GA4 deferred to post-launch, hardening is continuous not phased.

### Week 1 — Corpus + Retrieval Pipeline

**Day 1: Ground truth construction (before touching any code)**
1. Pick 15 companies across 3 verticals:
   - B2B SaaS (5): Stripe, Linear, Vercel, Notion, Clerk
   - E-commerce (5): Shopify storefront, Allbirds, Glossier, Apple Store, Warby Parker
   - Marketplace/Fintech (5): Airbnb, Coinbase, Robinhood, Lemonade, Wise
2. Screenshot 3 pages per company (landing, pricing/product, signup/checkout) → 45 screenshots
3. Decompose each screenshot into attributes using template:
   `Company | Page | Style | Colors | Typography | CTA | Layout | Trust signals | Notable pattern`
4. Group decompositions into 20-25 queries with expected results (flip: "what would a customer type to get this?")
5. Add 10 Dribbble/Awwwards edge-case queries (glassmorphism, brutalism, bento grid, etc.)
6. Final deliverable: `ground_truth.csv` — 30-35 rows, columns: `query | expected_style | expected_color | expected_layout | expected_ux | notes`

**Day 2: Corpus prep (parallelize with infra)**
7. Run all 30-35 ground truth queries against existing CSV using `search.py`, grade top 5 results per query (1=relevant, 0=not)
8. Compute Recall@5, Precision@3, MRR — identify gap categories (missing content vs. noisy ranking vs. vocabulary mismatch)
9. Fix gaps: add missing rows to CSVs, rewrite noisy rows, add synonyms for vocabulary mismatches
10. Write prose serializer for all 10 CSV domains
11. Synonym expansion for top 30 design terms (informed by step 8 vocabulary gaps)
12. Write 10 `component_patterns` rows (prioritize patterns found in ground truth that have no corpus match)
13. Re-run ground truth queries, confirm ≥75% usefulness

**Days 3-4: Retrieval infra**
14. OpenAI embedding + ChromaDB ingestion script
15. BM25 index builder
16. Minimal MCP server shell: one tool (`recommend_variants`), dense-only, returns top 3

**Day 5: Fusion + composition stack**
17. RRF fusion (BM25 + dense)
18. Cohere rerank + 5-min TTL cache + fallback to RRF order
19. `_embed()` + `query_with_embedding()` methods on `RetrievalPipeline` (share one embedding across all domain searches)
20. `select_variant_1()` — best match per domain using `ui-reasoning.csv` anchor keywords (`_find_reasoning_rule` + `_select_best_match`)
21. `compose_variants()` — retrieve top 8 per domain → V1 via reasoning anchor → V2/V3 via max embedding distance → post-hoc label assignment → assemble `ComposedVariant` bundles with attribution

**Gate:** Recall@5 ≥ 0.80, MRR ≥ 0.70 on test set. 3 composed variants for "checkout button B2B SaaS" differ on ≥3 of 5 axes (diversity spot-check). If not met, fix corpus over weekend before proceeding.

### Week 2 — Customer Layer + LTR Foundation

**Days 1-2: Customer data model (Supabase)**
1. Supabase migrations for new Orpheus tables: `brand_tokens`, `soft_guidance`, `optimization_goals`, `orpheus_variants`, `selection_events`, `edit_events`, `auto_apply_patterns`, `ab_results`, `ltr_training_examples`
2. SQLAlchemy models for new tables (extend existing `models.py` in Gushi repo)
3. `register_customer` tool — brand tokens, color constraints, typography → writes to `brand_tokens` table
4. Brand constraint application in retrieval (filter + boost)
5. `ComposedVariant` + `RetrievedDoc` dataclasses with full attribution fields

**Days 3-5: LTR system**
6. Feature extraction pipeline — bundle-level features (§6.7): aggregate retrieval scores, hypothesis category win rate, attribute win rates, context match
7. LambdaRank training script (lightgbm)
8. Generate synthetic training data: simulate 50 experiments from corpus
9. Train global baseline model
10. LTR re-ranking integrated at bundle level (after strategy composition, §6.7)
11. Attribution populated: feature importance + source docs per variant bundle

**Gate:** end-to-end call returns 3 hypothesis-driven, attributed variants respecting brand constraints. NDCG@3 > 0.65 on synthetic held-out set.

### Week 3 — Human-in-the-Loop + Signals

**Days 1-2: Selection + edit loop**
1. `log_selection` tool → pairwise training examples to `selection_events` (Supabase)
2. `log_edit` tool → diff extraction, edit_distance, confidence weight → `edit_events` (Supabase)
3. Edit attribution rule enforced (use edited attrs for A/B credit, not original)
4. Auto-apply trigger: 3+ repeated edits above threshold → `auto_apply_patterns` + corpus candidate

**Days 3-4: A/B ingestion**
5. `ingest_ab_result` tool — reads from `experiment_metrics` (existing), writes composite to `ab_results` (new)
6. Composite score computation with `METRIC_WEIGHTS` + `PAGE_INTENT_METRIC_FILTER`
7. Stage 2/3 scheduled jobs (day 7, day 30) → Supabase edge functions or cron
8. Nightly LTR retraining job (cron) — reads from `ltr_training_examples`

**Day 5: SDK wiring**
9. Wire React SDK events → `log_selection` / `ingest_ab_result` endpoints
10. Verify full loop: recommend → select → edit → ingest → recommend again → verify ranking shift

**Gate:** full feedback loop verified end-to-end with simulated customer data.

### Week 4 — Multi-Metric + Ship

**Days 1-2: Composite score expansion**
1. `PAGE_INTENT_METRIC_FILTER` routing
2. `GUARDRAIL_DEFAULTS` enforcement (disqualify on breach)
3. Customer optimization goals (primary/secondary metric override)
4. Reversal detection between stages

**Days 3-4: Integration stubs + hardening**
5. PostHog integration — experiment metrics pull (Stage 1 only, stages 2-3 post-launch)
6. GA4 stub — engagement metrics + traffic source segmentation (full integration post-launch)
7. Offline evaluation suite (automated, runs on corpus change)
8. Vocabulary gap detection script

**Day 5: Ship**
9. SDK packaged + integration guide
10. MCP server deployed for pilot customer
11. Attribution format validated with pilot (R3 spot-check)

**Gate:** pilot customer can install SDK, run experiment, see recommendation shift. All three requirements (R1/R2/R3) verified.

### What's deferred to month 2

| Item | Why deferred |
|------|-------------|
| Full PostHog Stage 2/3 pull automation | Needs 30 days of real data to test |
| GA4 full integration | Requires customer-side custom event setup |
| Embedding cluster visualization (UMAP) | Nice-to-have diagnostics, not blocking |
| Next.js/Vue/Svelte SDK ports | React SDK proves the event schema first |
| Cohen's kappa inter-rater validation | Need 2+ raters, recruit post-launch |
| Win rate tracking by document | Need real experiment volume |

---

## 19. Project Structure

### 1-Week MVP

Minimal — just enough to validate retrieval + diversity:

```
orpheus/
├── README.md
├── images/
├── corpus/
│   ├── csv/                    # Raw CSV data (10 domains)
│   ├── prose/                  # Serialized prose documents
│   └── ground_truth.csv        # 30-35 validation queries
├── scripts/
│   ├── serialize_corpus.py     # CSV → prose serializer
│   ├── ingest.py               # Embed + load into ChromaDB
│   └── evaluate.py             # Recall@5, Precision@3, MRR
├── server/
│   ├── __init__.py
│   ├── mcp_server.py           # MCP shell — single tool
│   ├── retrieval.py            # BM25 + dense + RRF + MMR
│   └── config.py               # API keys, model names
├── chroma_data/                # ChromaDB persistent storage
├── requirements.txt
└── pyproject.toml
```

No LTR, no customer state, no SDK. One tool (`recommend_variants`), flat structure, no premature nesting.

### 4-Week Full Build

Grows from the MVP — same roots, more modules:

```
orpheus/
├── README.md
├── images/
├── corpus/
│   ├── csv/                        # Raw CSV data
│   ├── prose/                      # Serialized prose docs
│   ├── component_patterns/         # Component pattern docs
│   ├── ground_truth.csv
│   └── validation/
│       ├── gap_analysis.py         # Query log gap detection
│       └── quality_scores.py       # Coverage & freshness checks
├── scripts/
│   ├── serialize_corpus.py
│   ├── ingest.py
│   ├── evaluate.py
│   └── synthetic_data.py           # Generate synthetic experiments for LTR
├── server/
│   ├── __init__.py
│   ├── mcp_server.py               # MCP entry point — all tools registered here
│   ├── config.py
│   ├── retrieval/
│   │   ├── __init__.py
│   │   ├── bm25.py
│   │   ├── dense.py                # Embedding search
│   │   ├── fusion.py               # RRF
│   │   ├── rerank.py               # Cohere cross-encoder + cache
│   │   ├── diversity.py            # MMR
│   │   └── composer.py             # compose_variants() — corpus-anchored composition (§6.6)
│   │                               #   STRATEGY_PROFILES, STRATEGY_HYPOTHESES, assign_strategy_label()
│   │                               #   select_variant_1() via ui-reasoning.csv anchor
│   │                               #   select_by_max_distance() for V2 and V3
│   ├── ltr/
│   │   ├── __init__.py
│   │   ├── features.py             # Bundle-level feature extraction (§6.7) — strategy win rates, attribute win rates
│   │   ├── train.py                # LambdaRank (lightgbm)
│   │   ├── predict.py              # Score variant bundles
│   │   └── retrain_job.py          # Nightly cron entry point
│   ├── customers/
│   │   ├── __init__.py
│   │   ├── brand_constraints.py    # Filter + boost logic
│   │   └── supabase.py             # Supabase queries: brand_tokens, soft_guidance, optimization_goals
│   ├── signals/
│   │   ├── __init__.py
│   │   ├── selection.py            # log_selection → pairwise labels → selection_events table
│   │   ├── edits.py                # log_edit → diff extraction → edit_events table
│   │   ├── ab_ingestion.py         # ingest_ab_result → reads experiment_metrics, writes ab_results
│   │   ├── composite_score.py      # METRIC_WEIGHTS, PAGE_INTENT_METRIC_FILTER, guardrails
│   │   └── supabase.py             # Supabase queries for all signal tables
│   ├── attribution/
│   │   ├── __init__.py
│   │   ├── builder.py              # Build ComposedVariant with full attribution
│   │   └── explanation.py          # User-facing explanation generator
│   └── integrations/
│       ├── __init__.py
│       ├── posthog.py              # Stage 1 experiment pull
│       └── ga4.py                  # Stub
├── sdk/
│   ├── package.json
│   ├── src/
│   │   ├── index.ts                # SDK entry point
│   │   ├── events.ts               # CTR, rage_click, dead_click, scroll_depth
│   │   ├── forms.ts                # form_start, form_abandonment, cta_click
│   │   ├── sessions.ts             # session_id, return_7d
│   │   └── flush.ts                # Batch flush + beforeunload
│   └── tsconfig.json
├── supabase/
│   └── migrations/
│       └── orpheus_tables.sql      # New Orpheus tables (§5.3)
├── chroma_data/
├── models/                         # Trained LTR model artifacts
├── tests/
│   ├── test_retrieval.py
│   ├── test_composition.py         # Corpus-anchored composition + embedding-distance diversity tests
│   ├── test_ltr.py
│   ├── test_signals.py
│   └── test_attribution.py
├── requirements.txt
└── pyproject.toml
```

### Key Decisions

1. **`server/retrieval/` split from monolith** — MVP has one `retrieval.py`; by Week 2 you're adding LTR scoring and corpus-anchored composition, so the pipeline stages need their own files. `composer.py` orchestrates per-domain retrieval → ui-reasoning.csv anchor → max-distance selection → post-hoc label assignment (§6.6). There is no `strategies.py` — strategy labels are post-hoc and live as constants inside `composer.py`.

2. **`server/customers/` and `server/signals/` are separate** — different concerns, same Supabase instance. Customers = brand constraints (applied during retrieval). Signals = feedback loop (consumed by LTR training). Both access Supabase via their own `supabase.py` query modules.

3. **`server/attribution/` is its own module** — attribution is stored at generation time and never reconstructed (§4.4). Keeping it separate enforces that boundary.

4. **`sdk/` is a separate TypeScript package** — it ships to customers as an npm package with its own build. Lives in the repo but is independently deployable.

5. **`corpus/validation/` not `scripts/`** — corpus quality checks are ongoing (Loop 1, §13), not one-off scripts.

6. **`supabase/migrations/`** — all new Orpheus tables live in a single migration file. Existing Probat tables (`app_users`, `experiment_runs`, `experiment_metrics`, `probat_events`, etc.) are referenced via FK but never modified.

### Migration Path (MVP → Full Build)

| When | What changes |
|------|-------------|
| MVP Day 5 | `server/retrieval.py` splits into `server/retrieval/*.py` (incl. `composer.py`); add `_embed()` + `query_with_embedding()` methods |
| Week 2 Day 1 | Run `supabase/migrations/orpheus_tables.sql`. Add `server/customers/`, `server/signals/` |
| Week 2 Day 3 | Add `server/ltr/` with bundle-level features |
| Week 3 | Add `server/attribution/`, wire signals to Supabase |
| Week 4 | Add `server/integrations/`, `sdk/` |

No rewrites — single files promote into modules as complexity demands.

---

## Summary

The architecture satisfies all three requirements:

**R1 (Good ideation):** Hybrid retrieval (BM25 + dense + RRF) per domain, followed by corpus-anchored composition: Variant 1 from `ui-reasoning.csv` expert knowledge, Variants 2–3 by maximum embedding distance with uniqueness enforcement. Strategy labels are post-hoc descriptors, not retrieval constraints — the corpus determines what "different" means for each query. No LLM in the design decision loop — deterministic, fast, fully attributed.

**R2 (Adaptability):** Three feedback layers that activate progressively as data accumulates. Layer 1: selection events → bundle-level pairwise labels (free, immediate). Layer 2: edit events → domain-level pairwise labels, split by type (swap_doc, freeform, partial); freeform edits additionally surface corpus gaps. Layer 3: A/B results → CVR-weighted labels, staged at day 0/7/30. LambdaRank trains on all three sources nightly. `global_win_rate` in ChromaDB metadata updates nightly so retrieval is adaptive even before LTR is built.

**R3 (Referenceability):** Attribution stored at generation time in `ComposedVariant`. Every recommendation traceable to specific corpus doc IDs, the `ui-reasoning.csv` row that anchored V1, retrieval scores per stage, LTR feature importance, and contributing experiments. User-facing explanation generated from stored attribution — never reconstructed after the fact.

The biggest execution risk is corpus quality. Every other component amplifies the corpus — it cannot compensate for gaps in it. Week 1 corpus validation is the most important gate in the entire rollout.
