skill_20cycles

browsecomp — 10h06m30s — 9 experiments, 1 merged
CWD: /Users/eri/py_code/browsecompGit: main @ 8d9b176Trunk: research/skill_20cycles/trunkConfig: research_config.yaml
Test Set
Primary Metric
Baseline
32.7%
Final
50.7%
+18.0%
Dev Set
Iteration
Baseline
50.0%
Final
62.5%
+12.5%
Experiments
done: 5merged: 1pruned: 3
Baseline 50.0% Trunk 62.5% 5.4 62.5% 5.2 57.5% 5 55.0% 2 52.5% 5.3 52.5% 5.1 47.5% 3 45.0% 1 42.5% 4 25.0%
ROOT done Optimize BrowseComp search agent accuracy. The current pipeline is ...
Optimize BrowseComp search agent accuracy. The current pipeline is a minimal ReAct loop (single_agent_gpt.py, GPTAgent + SearchTool + VisitTool, GPT_SYSTEM_PROMPT in Gizmo/prompts/system_prompt.py). The evaluator (o3) runs on bc_val (40 questions) for B_dev iteration and bc_test (300 questions) for milestone verification — re-run the baseline yourself to establish the current B_dev score, do not assume any prior number. BrowseComp questions are adversarial multi-hop queries with 4–6 simultaneous constraints; the substring-match grader in run_eval.py accepts any answer string containing the gold (or vice versa). Do NOT modify run_eval.py or the data files.
Insight:
# Global Research Insight: BrowseComp Agent Improvements

**The binding constraint is candidate retrieval coverage, not reasoning/judging sophistication.** Across all interventions, accuracy was bottlenecked by whether the correct entity ever entered the candidate pool — everything downstream is rearranging deck chairs.

**What didn't work:**
- **Prompt-only self-correction** (structured belief tables, abstain gates): the model dresses up its first commitment without revising it (42.5%).
- **Same-model adversarial falsifiers**: shared failure modes reproduce trunk errors (52.5%).
- **Type-anchored enumeration via Wikipedia lists**: long-tail targets aren't on list/category pages; empty plans waste budget (45%).
- **Question decomposition into independent sub-questions**: destroys the *conjunctive* disambiguation that makes BrowseComp answers findable (25%).
- **Persona-diversified ensembles**: stylistic constraints weaken individual agents; diversity ≠ coverage.

**What worked:**
- **Ensembling for coverage + a judge with its own tool budget and override authority**, gated by structural constraint-PASS checks (62.5%). This breaks the "must pick one of N retrieved candidates" ceiling on ~10% of questions.

**Actionable principle:** Spend compute on (a) more capable individual agents and (b) a tool-empowered, constraint-gated judge that can *exceed* the candidate set — not on prompt scaffolding, persona diversity, decomposition, or same-model critics.
1 pruned 42.5% (-20.0) Mechanism: Restructure the agent loop around an explicit constraint...
Mechanism: Restructure the agent loop around an explicit constraint checklist + multi-candidate beam — parse the question into an enumerated list of atomic constraints up-front; maintain ≥2 candidate answers as a structured belief state; after each tool round update a per-candidate × per-constraint pass/fail/unknown table; only finalize when one candidate has all constraints passed and no surviving alternative passes equally. Hypothesis: 9/20 baseline failures are confidently-wrong entities that satisfy 2-3 constraints; forcing the agent to evaluate ALL constraints against MULTIPLE candidates structurally prevents premature commitment (the bottleneck CLASS named in Q1). Observable: B_dev ≥ +10pp, with the qualitative shift that "wrong-entity" failures convert to either correct or "no answer" rather than confident-wrong (e.g. Tony Blair, Emma Stone, Lil Mosey style errors disappear). Conflicts: none — attacks an unexplored axis (no prior node).
Insight:
A prompt-only 'structured belief state' is insufficient to overcome the model's bias toward committing to the first plausible candidate—it just dresses up the same wrong commitment in a PASS/FAIL table. The per-turn table also eats into the token budget, reducing useful tool rounds. Enforcing the gate likely requires a separate judge model.
[Pruned: Prompt-only structured belief state regressed (-7.5pp); superseded by 5.4 ensemble.]
Result: Restructured agent loop with constraint checklist + multi-candidate beam + finalize gate scored 42.5% (17/40) on B_dev, −7.5pp vs the 50% baseline; only 4/23 failures shifted to refusals while 19/23 remained confident wrong-entity answers.
Branch: research/skill_20cycles/1/mechanism-restructure-the-agent-loop-around-an-exp
2 done 52.5% (-10.0) Mechanism: Add a post-hoc adversarial falsification stage — once th...
Mechanism: Add a post-hoc adversarial falsification stage — once the main ReAct loop produces a tentative answer, spawn a "Devil's Advocate" subroutine that (a) re-reads the question's constraints, (b) explicitly searches for ALTERNATIVE candidate answers and for counter-evidence on each constraint of the current answer, (c) returns a verdict {confirm, replace-with-X, abstain}. If non-confirm, the loop re-enters investigation with the falsifier's notes injected as context. Hypothesis: The trunk's failure mode is overconfidence in the first plausible match; an adversarial pass operating with a contrary objective (find a better answer / break this one) breaks the confirmation-bias loop without trusting the same agent to second-guess itself. Observable: B_dev ≥ +10pp; specifically, cases where the baseline produced a confident wrong entity now either flip to correct or produce abstentions rather than confident-wrong predictions. Conflicts: none — attacks the verification axis with a separate-role agent, orthogonal to representation-level fixes (sibling 1).
Insight:
An adversarial falsifier using the same model and tools mostly confirms the original answer, since shared failure modes (unreachable evidence, same constraint-checking) reproduce the trunk's conclusions. Breaking confirmation bias likely requires a different model/tool budget or a stricter abstain-default policy.
Result: Falsifier stage achieved 52.5% (21/40) vs 50.0% baseline on B_dev — only +2.5pp, well below the +10pp target, with just one net-positive flip.
Branch: research/skill_20cycles/2/mechanism-add-a-post-hoc-adversarial-falsification
3 pruned 45.0% (-17.5) Mechanism: Add a "type-anchored enumeration" retrieval stage — at t...
Mechanism: Add a "type-anchored enumeration" retrieval stage — at the start of each task, an enumerator LLM call extracts (entity_type, ≤2 hard discriminative constraints) from the question and issues list-style queries ("list of <type> <hard-filter>", "site:en.wikipedia.org intitle:list", category-page lookups). The returned page is parsed into a candidate-name population (≥10 names when possible) which is injected into the main ReAct agent's context as "Candidate population: [...]". The agent is instructed to sieve candidates against ALL constraints rather than guess. Hypothesis: ~half of baseline failures are long-tail entities (Sankomota, Ellison Bay, Morris Museum, Air City, Clare Wigfall, Ross Andel) that conjunctive keyword search never surfaces. Pre-fetching a candidate POPULATION via list/index pages converts the problem from open-ended search to closed-set filtering, where the agent's verification ability is sufficient. Observable: B_dev ≥ +10pp; specifically, several gold answers from the obscure-entity wrong list (Sankomota, Ellison Bay, Morris Museum, Air City) flip to correct because they now appear in the candidate set. Conflicts: none — both prior siblings (1,2) attacked the verification axis; this attacks an unexplored retrieval-action-space axis. Counters root-insight requirement that interventions introduce asymmetry: the enumerator uses different query patterns than the trunk's reformulation queries.
Insight:
Type-anchored enumeration fails when (1) Wikipedia list/category pages don't actually contain the long-tail targets, and (2) o3 via Chat Completions frequently returns empty JSON plans, so half of questions skip retrieval entirely. The added latency/tokens on firing cases without recovering targeted entities reduced accuracy.
[Pruned: Type-anchored enumeration regressed (-5pp); superseded by 5.4.]
Result: Enumerator produced candidate populations for only 12/40 questions; none contained the targeted obscure entities (Sankomota, Ellison Bay, Morris Museum, Air City, etc.). B_dev accuracy dropped from 50% (20/40) baseline to 45% (18/40).
Branch: research/skill_20cycles/3/mechanism-add-a-type-anchored-enumeration-retrieva
4 pruned 25.0% (-37.5) Mechanism: Decompose-and-intersect orchestration — a planner LLM ca...
Mechanism: Decompose-and-intersect orchestration — a planner LLM call splits the question into K independent atomic sub-questions, each of which has a SET-valued answer (e.g. "actresses born in cities whose flag has a legendary creature", "stars of late-2010s American TV series"). Spawn K parallel sub-agents, EACH with its own fresh context (no shared trajectory => no shared confirmation bias). Each sub-agent returns a list of ≥5 candidate answers. The orchestrator computes the set intersection; the singleton (or smallest) survivor is the final answer. If empty, relax the lowest-confidence sub-question and retry. Hypothesis: BrowseComp questions are conjunctions of atomic constraints; running them as ONE narrative biases the agent toward the first hypothesis that satisfies a few constraints. Solving each constraint INDEPENDENTLY in isolated contexts produces uncorrelated candidate lists; the intersection is forced to be the answer (analogy: SAT solver, constraint propagation). Observable: B_dev ≥ +10pp; qualitative shift = wrong-entity errors drop because no single sub-agent ever sees enough context to "rationalize" a wrong candidate, and intersection rejects partial matches. Conflicts: none on the orchestration axis. Counters node 2 (single sequential falsifier shared model bias) by enforcing cross-context isolation as the source of asymmetry root-insight requires; counters node 1 (in-context table = same model) by moving the structure into pipeline orchestration rather than prompting.
Insight:
Decomposing BrowseComp questions into independent set-valued sub-questions destroys the conjunctive disambiguation signal: individual constraints are often non-enumerable and match thousands of entities, so intersections are empty and the verifier picks wrong candidates. Cross-context isolation was achieved but harmful — joint constraints are what makes the answer findable.
[Pruned: Decompose-and-intersect destroyed conjunctive signal (-25pp); superseded by 5.4.]
Result: Decompose-and-intersect orchestration scored 25% (10/40) on B_dev, halving the vanilla ReAct trunk baseline of 50% (20/40).
Branch: research/skill_20cycles/4/mechanism-decompose-and-intersect-orchestration-a
5 done 55.0% (-7.5) Mechanism: Cross-trajectory ensembling — run N=4 INDEPENDENT main R...
Mechanism: Cross-trajectory ensembling — run N=4 INDEPENDENT main ReAct agents in parallel on the same question (fresh context, fresh trajectory each), then a JUDGE LLM stage receives all N (answer, evidence-summary) tuples and selects the final answer using two structural signals: (a) cross-run AGREEMENT — answers that multiple agents converged to from different trajectories are weighted up; (b) constraint coverage — evidence cited explicitly satisfies more of the question's atomic constraints. The judge is required to either pick one of the N answers OR pick "abstain → re-run with cross-run notes" (NOT free-form invent). Hypothesis: Self-correction inside one trajectory shares the model's confirmation bias (root insight). N independent trajectories produce uncorrelated answer distributions; even when individual accuracy is 50%, agreement-of-2-out-of-4 strongly indicates correctness, and disagreement gives the judge a real signal. This converts the bias problem into an aggregation problem, which is solvable with a structural (not free-form) judge. Observable: B_dev ≥ +10pp; specifically, on questions where any individual run got the answer right, the ensemble should preserve it; the lift comes from N=4 sampling at least one correct trajectory more often than the trunk's single shot. Conflicts: Counters node 2 (sequential same-context falsifier) by replacing self-critique with population-level aggregation — the asymmetry root-insight demanded comes from N independent contexts, not from a single second pass. Counters node 4 (decompose-and-intersect destroyed conjunctive signal) by keeping the FULL question intact in every sub-run — only aggregation differs.
Insight:
**Synthesis (Node 5: Cross-trajectory ensembling)**

Independent N=4 ensembling with a structural judge works, but the binding constraint is **candidate-set retrieval coverage**, not judge sophistication or trajectory diversity:

1. **Diversity via personas backfires (5.1):** Style-constrained agents waste budget on ill-fitting actions, weakening every candidate. BrowseComp's ceiling is tool-reachability, not action-distribution variety — keep agents capable, not stylistically varied.

2. **Summary→evidence judge upgrade gives only marginal lift (5.2, 5.3):** When all 4 runs miss the entity, no amount of judge verification recovers it. Run-to-run variance (~±5pp) swamps small parser/prompt fixes on 40-Q dev.

3. **Judge override with its own tool budget is the real unlock (5.4, +62.5%):** Allowing the judge to *exceed* the candidate set, gated by constraint-PASS checks, fired on ~10% of questions and broke the correlated-failure ceiling. This validates that "must pick one of N" was the true bottleneck.

**Actionable takeaway:** Ensemble for *coverage*, not diversity. Invest compute in (a) more capable individual agents and (b) a tool-empowered judge with override authority + structural constraint gating — not in persona variation or prompt-only judges.
Result: N=4 independent ReAct agents with an o3 structural judge scored 22/40 = 55% on B_dev, +5pp over the 50% trunk baseline but below the +10pp target, at ~4x wallclock.
Branch: research/skill_20cycles/5/mechanism-cross-trajectory-ensembling-run-n-4-inde
5.1 done 47.5% (-15.0) Mechanism: Heterogeneous role-diversified ensemble — extend node 5'...
Mechanism: Heterogeneous role-diversified ensemble — extend node 5's N-agent ensemble by giving each agent a DIFFERENT system-prompt persona that biases its action distribution toward a complementary search strategy. Concretely 4 personas: (R1) "Type-anchored enumerator" — first 3 turns must be list/category queries (Wikipedia lists, "list of X", site:en.wikipedia.org intitle:list); (R2) "Contradiction hunter" — every search MUST include negative/exclusionary terms targeting the question's most distinctive constraint; (R3) "Document spelunker" — prefer visit over search, exhaust each promising URL with deep reads + reference-following; (R4) "Lateral re-phraser" — required to issue queries in ≥2 languages and via paraphrases that drop the most popular keyword. Same judge stage as node 5 receives all 4 (answer, evidence) tuples. Hypothesis: Node 5's +5pp ceiling is set by correlated cross-run failures: identical persona → identical search distribution → all 4 trajectories starve on the same long-tail evidence. Forcing structurally different action distributions raises the union-coverage probability; even if individual personas are slightly weaker, P(at least one persona surfaces the gold entity) rises sharply because their failure sets are uncorrelated by construction. Observable: B_dev ≥ +10pp over baseline (≥60%); qualitatively, gold entities currently unreachable for all 4 homogeneous runs (Sankomota, Ellison Bay, Morris Museum) appear in at least one persona's evidence trace. Conflicts: Counters node 4 (decompose-and-intersect destroyed conjunctive signal) by keeping the FULL question intact in every persona — only the SEARCH POLICY is diversified, not the question. Counters node 3 (single-shot enumerator failed because Wikipedia coverage was thin) by combining the enumerator strategy with 3 alternative strategies so its failures are recoverable. Counters node 5's correlation ceiling by introducing structural diversity rather than relying on temperature-driven variance.
Insight:
Diversifying ensemble agents via persona-specific search policies hurt rather than helped: constrained personas wasted budget on style-mandated actions ill-suited to many questions, producing individually weaker candidates that the judge couldn't reliably filter. Confirms that BrowseComp's ceiling is retrieval-tool reachability, not action-distribution diversity.
Result: Heterogeneous 4-persona ensemble scored 47.5% (19/40), below the 50% trunk baseline and 7.5pp below node-5's homogeneous 55% ensemble.
Branch: research/skill_20cycles/5.1/mechanism-heterogeneous-role-diversified-ensemble
5.2 done 57.5% (-5.0) Mechanism: Evidence-grounded judge with constraint cross-check — ke...
Mechanism: Evidence-grounded judge with constraint cross-check — keep node 5's homogeneous N=4 ensemble, but upgrade the judge stage from "summary-only" to "evidence-grounded with verification budget". The judge: (1) builds the atomic-constraint checklist, (2) for EACH candidate answer it has access to that run's full tool-call trace AND can issue its own additional search/visit calls (capped at 8) to verify any UNKNOWN cells; (3) selects only the candidate with strictly highest verified-PASS count; ties → most-supported by independent verification, never by run-count plurality alone. Hypothesis: Node 5's +5pp ceiling persists because the judge sees only short answer-summaries and falls back to plurality voting — when 3/4 runs converge on a confident-wrong entity (correlated failure), plurality picks wrong. Giving the judge its own verification tool budget breaks plurality dominance: it can independently check whether the disagreeing 4th candidate actually satisfies the rare/distinctive constraint that the plurality fails on. Observable: B_dev ≥ +10pp over baseline (≥60%); qualitatively, cases where the correct answer was the MINORITY vote among the 4 runs (i.e. only 1 run got it right, but it was right) now flip from wrong→correct because the judge can independently verify constraints that distinguish them. Conflicts: Sibling 5.1 attacks correlation by diversifying agent SEARCH POLICIES; this attacks the same root-insight-mandated asymmetry by giving the JUDGE asymmetric capability (its own tool budget, evidence access). Different lever on the same problem; both could be combined later.
Insight:
Evidence-grounded judge with its own tool budget gives only a small lift over summary-judge ensemble; the dominant ceiling is candidate-set retrieval (when all 4 runs miss the entity, no judge verification can recover it). A parsing bug where the judge outputs 'Candidate N' literally also costs several percent.
Result: 57.5% (23/40) on B_dev, +2.5pp over node 5 baseline (55%) and +7.5pp over trunk (50%), missing the ≥60% target.
Branch: research/skill_20cycles/5.2/mechanism-evidence-grounded-judge-with-constraint
5.3 done 52.5% (-10.0) Mechanism: Fix the judge-output extraction bug in node 5.2 — when t...
Mechanism: Fix the judge-output extraction bug in node 5.2 — when the judge agent emits "FINAL_ANSWER: Candidate N" instead of the verbatim text, parse with regex `Candidate\s*(\d+)` and look up that run's prediction. Also tighten JUDGE_AGENT_PROMPT with an explicit example: "WRONG: 'FINAL_ANSWER: Candidate 2' / RIGHT: 'FINAL_ANSWER: Sankomota'". Keep the rest of 5.2 (evidence-grounded judge with 8-tool budget) unchanged. Hypothesis: 5.2's reported 57.5% score is artificially depressed by an output-format slip — the judge often selected the right candidate but referred to it by index, which then failed substring grading. Fixing the parser converts those silent successes into recorded successes (subagent estimated "several percent" lift), pushing 5.2 over the +10pp merge threshold. Observable: B_dev ≥ 60% (≥+10pp); failure-detail diff vs 5.2 should show questions previously graded wrong with prediction="Candidate N" now graded correct with prediction=the actual candidate. Conflicts: none — directly fixes a defect in sibling 5.2 without changing the mechanism. Validated by 5.2's own failure analysis.
Insight:
The parser fix was correctly applied but rarely triggered on this 40-question dev set — the judge mostly emitted verbatim text already, and the two extraction failures were empty-judge-text cases the fix can't address. Run-to-run ensemble variance (~±5pp) dominates any small lift from the fix.
Result: Achieved 52.5% (21/40) on dev, below node 5.2's reported 57.5% and well below the +10pp merge threshold of 60%.
Branch: research/skill_20cycles/5.3/mechanism-fix-the-judge-output-extraction-bug-in-n
5.4 merged 62.5% (0.0) Mechanism: Judge-with-override — extend 5.2's evidence-grounded jud...
Mechanism: Judge-with-override — extend 5.2's evidence-grounded judge so that, after independently verifying constraints for the N=4 candidates, if NO candidate has all constraints PASS the judge is permitted to keep searching/visiting (extended budget = 20 tool rounds) and emit a NEW candidate answer it discovered itself. The judge-agent is given explicit "fresh answer" authority with the requirement that its own answer must show ALL atomic constraints PASS in its trajectory before being emitted; otherwise it falls back to the highest-coverage existing candidate. Apply the verbatim-extraction fix from 5.3. Hypothesis: The current ensemble ceiling is set by the rule "judge must pick from N=4 candidates" (root insight: when all 4 runs miss the long-tail entity, no aggregation recovers). Allowing the judge to act as a fifth agent — armed with the question + the 4 failed candidates' constraint-failure analysis, which is asymmetric information none of the original agents had — opens the action space exactly when it's needed (correlated-failure cases). This breaks the retrieval ceiling without adding a free-form invention path. Observable: B_dev ≥ 60% (≥+10pp); qualitatively, on questions where 5/5.2 returned a confident-wrong entity (e.g. Sankomota, Ellison Bay, Morris Museum cases), the judge's override should emit the gold via its own search. Override-rate trace metric: how many questions where the judge produced a non-candidate answer. Conflicts: Counters root insight's "candidate-set retrieval ceiling" claim — by adding the judge's OWN tool budget and FREEDOM to emit non-candidate answers, we directly attack the retrieval ceiling (rather than working around it). Differs from 5.2: 5.2 forced "must pick from candidates"; this lifts that constraint precisely when it bites.
Insight:
Granting the judge override authority with its own tool budget and constraint-PASS gating breaks the candidate-set retrieval ceiling on correlated-failure cases. The override fired on 10% of questions and contributed meaningful net lift, supporting the hypothesis that 'must pick from candidates' was the binding constraint.
Result: B_dev accuracy reached 62.5% (25/40), +12.5pp over trunk baseline (50%) and +5pp over 5.2 (57.5%), meeting the ≥60% target. Judge override fired on 4/40 questions (10%).
Branch: research/skill_20cycles/5.4/mechanism-judge-with-override-extend-5-2-s-evidenc
#NodeScoreDelta StatusHypothesisInsight
1 5.4 62.5% 0.0 merged Mechanism: Judge-with-override — extend 5.2's evidence-gr... Granting the judge override authority with its own tool b...
2 5.2 57.5% -5.0 done Mechanism: Evidence-grounded judge with constraint cross-... Evidence-grounded judge with its own tool budget gives on...
3 5 55.0% -7.5 done Mechanism: Cross-trajectory ensembling — run N=4 INDEPEND... **Synthesis (Node 5: Cross-trajectory ensembling)** Inde...
4 2 52.5% -10.0 done Mechanism: Add a post-hoc adversarial falsification stage... An adversarial falsifier using the same model and tools m...
5 5.3 52.5% -10.0 done Mechanism: Fix the judge-output extraction bug in node 5.... The parser fix was correctly applied but rarely triggered...
6 5.1 47.5% -15.0 done Mechanism: Heterogeneous role-diversified ensemble — exte... Diversifying ensemble agents via persona-specific search ...
7 3 45.0% -17.5 pruned Mechanism: Add a "type-anchored enumeration" retrieval st... Type-anchored enumeration fails when (1) Wikipedia list/c...
8 1 42.5% -20.0 pruned Mechanism: Restructure the agent loop around an explicit ... A prompt-only 'structured belief state' is insufficient t...
9 4 25.0% -37.5 pruned Mechanism: Decompose-and-intersect orchestration — a plan... Decomposing BrowseComp questions into independent set-val...
Global Research Insights
# Global Research Insight: BrowseComp Agent Improvements

**The binding constraint is candidate retrieval coverage, not reasoning/judging sophistication.** Across all interventions, accuracy was bottlenecked by whether the correct entity ever entered the candidate pool — everything downstream is rearranging deck chairs.

**What didn't work:**
- **Prompt-only self-correction** (structured belief tables, abstain gates): the model dresses up its first commitment without revising it (42.5%).
- **Same-model adversarial falsifiers**: shared failure modes reproduce trunk errors (52.5%).
- **Type-anchored enumeration via Wikipedia lists**: long-tail targets aren't on list/category pages; empty plans waste budget (45%).
- **Question decomposition into independent sub-questions**: destroys the *conjunctive* disambiguation that makes BrowseComp answers findable (25%).
- **Persona-diversified ensembles**: stylistic constraints weaken individual agents; diversity ≠ coverage.

**What worked:**
- **Ensembling for coverage + a judge with its own tool budget and override authority**, gated by structural constraint-PASS checks (62.5%). This breaks the "must pick one of N retrieved candidates" ceiling on ~10% of questions.

**Actionable principle:** Spend compute on (a) more capable individual agents and (b) a tool-empowered, constraint-gated judge that can *exceed* the candidate set — not on prompt scaffolding, persona diversity, decomposition, or same-model critics.