Mechanism: Cross-trajectory ensembling — run N=4 INDEPENDENT main ReAct agents in parallel on the same question (fresh context, fresh trajectory each), then a JUDGE LLM stage receives all N (answer, evidence-summary) tuples and selects the final answer using two structural signals: (a) cross-run AGREEMENT — answers that multiple agents converged to from different trajectories are weighted up; (b) constraint coverage — evidence cited explicitly satisfies more of the question's atomic constraints. The judge is required to either pick one of the N answers OR pick "abstain → re-run with cross-run notes" (NOT free-form invent).
Hypothesis: Self-correction inside one trajectory shares the model's confirmation bias (root insight). N independent trajectories produce uncorrelated answer distributions; even when individual accuracy is 50%, agreement-of-2-out-of-4 strongly indicates correctness, and disagreement gives the judge a real signal. This converts the bias problem into an aggregation problem, which is solvable with a structural (not free-form) judge.
Observable: B_dev ≥ +10pp; specifically, on questions where any individual run got the answer right, the ensemble should preserve it; the lift comes from N=4 sampling at least one correct trajectory more often than the trunk's single shot.
Conflicts: Counters node 2 (sequential same-context falsifier) by replacing self-critique with population-level aggregation — the asymmetry root-insight demanded comes from N independent contexts, not from a single second pass. Counters node 4 (decompose-and-intersect destroyed conjunctive signal) by keeping the FULL question intact in every sub-run — only aggregation differs.
Insight:
**Synthesis (Node 5: Cross-trajectory ensembling)**
Independent N=4 ensembling with a structural judge works, but the binding constraint is **candidate-set retrieval coverage**, not judge sophistication or trajectory diversity:
1. **Diversity via personas backfires (5.1):** Style-constrained agents waste budget on ill-fitting actions, weakening every candidate. BrowseComp's ceiling is tool-reachability, not action-distribution variety — keep agents capable, not stylistically varied.
2. **Summary→evidence judge upgrade gives only marginal lift (5.2, 5.3):** When all 4 runs miss the entity, no amount of judge verification recovers it. Run-to-run variance (~±5pp) swamps small parser/prompt fixes on 40-Q dev.
3. **Judge override with its own tool budget is the real unlock (5.4, +62.5%):** Allowing the judge to *exceed* the candidate set, gated by constraint-PASS checks, fired on ~10% of questions and broke the correlated-failure ceiling. This validates that "must pick one of N" was the true bottleneck.
**Actionable takeaway:** Ensemble for *coverage*, not diversity. Invest compute in (a) more capable individual agents and (b) a tool-empowered judge with override authority + structural constraint gating — not in persona variation or prompt-only judges.
Result: N=4 independent ReAct agents with an o3 structural judge scored 22/40 = 55% on B_dev, +5pp over the 50% trunk baseline but below the +10pp target, at ~4x wallclock.
Branch: research/skill_20cycles/5/mechanism-cross-trajectory-ensembling-run-n-4-inde