Optimize terminal-bench-2 agent performance. The current baseline achieves 58.33% on dev (21/36), 69.81% on held-out test (37/53), and 65.17% on all 89 tasks using GPT-5.5. Treat these as distinct metrics: B_dev is data/dev.json, B_test is data/test.json, and the all-89 score is diagnostic only. Never use the all-89 score (65.17%) as the B_test baseline/trunk score.
The agent is a ReAct-style terminal agent (terminus-2) that solves command-line tasks inside Docker containers. Tasks span 16 categories (software-engineering, security, scientific-computing, data-science, games, debugging, etc.) at easy/medium/hard difficulty.
## What you CAN modify (optimization levers) - agent/templates/system_prompt.txt — the system prompt controlling the
agent's reasoning, planning, and command generation. This is the PRIMARY
lever. It MUST preserve {instruction} and {terminal_state} placeholders.
- prompts/extra_instruction.txt — appended to each task instruction.
SECONDARY lever (currently empty).
- agent/optimized_terminus.py — the agent subclass. You can add
preprocessing, postprocessing, or override methods.
- agent/terminus_2.py — the base agent implementation. You can modify
the ReAct loop, parsing, retry logic, etc.
- agent/terminus_json_plain_parser.py, agent/terminus_xml_plain_parser.py
— response parsers.
- agent/tmux_session.py — terminal session management. - Any new files you want to create under agent/ or prompts/.
## What you MUST NOT modify - run_eval.py — the evaluation harness (read-only) - data/*.json — task lists and metadata (read-only) - .env — API keys and environment config (read-only) - .research_baseline.json — baseline reference (read-only)
## Evaluation - Dev evaluation (use for experiment iterations):
HARBOR_N_CONCURRENT=8 python3 run_eval.py --data data/dev.json --run-name <descriptive_name> --workers 8
- Test verification (use to validate promising results on held-out set):
HARBOR_N_CONCURRENT=8 python3 run_eval.py --data data/test.json --run-name <descriptive_name>_test --workers 8
- Baseline reference values:
B_dev baseline = 58.33% (21/36)
B_test baseline = 69.81% (37/53)
all-89 baseline = 65.17% (diagnostic only; do not use as B_test baseline)
- Output format: last stdout line is JSON with {"accuracy": float, "correct": int, "total": int, ...} - Always iterate on dev first. Only run test to verify after a meaningful dev improvement.
## Docker / evaluation concurrency discipline - Do NOT overlap full dev/test evaluations. At most one `python3 run_eval.py --data data/dev.json`
or `python3 run_eval.py --data data/test.json` process may run at a time across all sub-agents.
- Use tiny smoke tests first. - Use HARBOR_N_CONCURRENT=8 for full dev/test runs. Do not use 16+ unless explicitly instructed. - After any interrupted or failed Harbor run, clean up its Docker compose containers/networks before
launching another full eval. Docker's default bridge address pools are limited and overlapping
compose jobs can fail with "could not find an available, non-overlapping IPv4 address pool".
- Prefer one sub-agent experiment at a time for any idea that may run the dev split. Do not dispatch
multiple eval-running ideas via RunSubagentParallel.
## Evaluation discipline / early stopping - Terminal-bench evaluations can be slow because a few long-tail tasks may run until timeout. - While an eval is running, periodically inspect results/<run-name>/result.json (or the matching
worktree results directory) to monitor n_completed_trials, n_running_trials, n_pending_trials,
current reward counts, and errors.
- Estimate an optimistic upper bound before waiting for long-tail tasks:
max_possible_correct = current_correct + n_running_trials + n_pending_trials
max_possible_accuracy = max_possible_correct / n_total_trials
- If max_possible_accuracy cannot beat the current trunk/best dev score, stop waiting for the
remaining long-tail tasks. Terminate the eval/harbor process, clean up its Docker compose
containers/networks, record the partial evidence, and prune the idea.
- Only wait for long-tail tasks when the candidate can still plausibly beat the current trunk/best
score or when the partial results are needed to diagnose a promising failure mode.
## Case-first failure analysis - Before proposing or running a full-dev experiment, inspect concrete cases and error causes.
Read representative per-task artifacts from prior runs: metrics.json, eval_details.json,
result.json, trial.log, verifier output, and agent trajectory/debug logs when available.
- Each experiment proposal must name the exact failed task(s) or exception class it targets,
the observed root cause, and the mechanism by which the proposed change should fix it.
- Do not repeat broad prompt rewrites or generic verification rules unless the case analysis
shows a specific failure trace they address. Prefer narrow fixes tied to observed evidence.
## Smoke-test protocol - Use more smoke tests before spending a full dev run. For each candidate, first run a small
targeted smoke eval on 2-5 tasks with HARBOR_N_CONCURRENT=2 or 3.
- A smoke set should include at least one task the change is expected to fix and one regression
guard task that previously passed. Record pass/fail, exception type, and the key log snippet.
- Only launch full B_dev after the smoke result supports the hypothesis or after explaining why
a smoke test cannot exercise the targeted failure mode. Only launch B_test after meaningful
B_dev improvement or for fixes known to affect test-only infrastructure failures.
## Hints - The system prompt is ~100 lines of JSON-structured ReAct instructions.
Consider: better reasoning strategies, error recovery, task decomposition,
domain-specific heuristics, output formatting, etc.
- The agent interacts via tmux keystrokes; be careful with command timing. - Look at failed tasks to understand failure modes (wrong approach, parsing
errors, timeouts, etc.) — results are saved in results/<run-name>/.
- You can also modify the agent code itself (terminus_2.py, parsers, etc.)
to improve robustness, not just the prompt.
Insight:
**Global Research Insights**
Two complementary strategies emerge for stacking gains without regression risk on a divergent dev/test split:
**1. Dormant infra-robustness patches (highest confidence).** Wrap known failure points (e.g., Azure ContentPolicyViolation in `_query_llm`) with sanitize-then-retry logic whose activation conditions don't occur on dev. Net effect: zero dev risk, +1.89pp on B_test by recovering ~4 crashed tasks. Mine test-side error traces (content filters, retry exhaustion, truncation boundaries) for more such patches; target *triggers* (e.g., long-tail observation bytes) not symptoms.
**2. Mechanism-grounded execution tips (moderate confidence).** Prompt additions that fix concrete execution bottlenecks transfer; cognitive checklists don't. Background-execution for >30s commands gave +5.55pp dev with no regressions; an output-validation menu cost −5.55pp by consuming turns on non-bottleneck behavior.
**Decision rules:**
- Single-task-magnitude dev deltas (~2.78pp) are noise; require mechanism + multi-category plausibility.
- Prefer execution-mechanics fixes (timeouts, retries, backgrounding) over reasoning menus.
- Prioritize changes whose activation is provably absent on dev — they stack gains monotonically.
**Next:** Promote background-execution tip to trunk; continue trace-mining for dormant test-only patches.