Metadata-Version: 2.4
Name: evo-autoresearch-novelty-bench
Version: 0.1.1
Summary: Benchmark for measuring novelty in autonomous research-agent proposals, built on Prime Intellect's autonomous-speedrunning archive.
Author: Evo HQ
License: Apache-2.0
Project-URL: Homepage, https://github.com/evo-hq/autoresearch-novelty-bench
Project-URL: Source, https://github.com/evo-hq/autoresearch-novelty-bench
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.2
Requires-Dist: pyarrow>=15.0
Requires-Dist: pydantic>=2.6
Requires-Dist: numpy>=1.26
Requires-Dist: huggingface_hub>=0.24
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == "openai"
Provides-Extra: bge
Requires-Dist: sentence-transformers>=2.7; extra == "bge"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: openai>=1.40; extra == "dev"
Requires-Dist: sentence-transformers>=2.7; extra == "dev"

# autoresearch-novelty-bench

A benchmark for testing whether an autonomous AI research agent proposes
**novel, mechanism-distinct hypotheses** that **anticipate breakthroughs**
later found by other researchers.

**By [Evo](https://evo-hq.com).** Built on Prime Intellect's
[autonomous-speedrunning archive](https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning)
— two AI agents (Claude Code and Codex) competing on `modded-nanogpt`'s
optimization speedrun.

---

## What is this

You give a snapshot in time: "here's where two agents got on the
modded-nanogpt speedrun by 2026-05-12 11:17." Your proposer agent writes
N candidate ideas for how to go faster. The benchmark grades each candidate:

- **Rediscovery?** It matches something already tried before this moment.
- **Anticipation?** It matches something that other researchers proved out
  *later* (you predicted a winning thread).
- **Novel?** Mechanistically new — no prior or future match.
- **Invalid?** Malformed or missing required structure.

Plus a diversity bonus for keeping the N candidates mechanistically
distinct, and a validity term for how many of the N parse cleanly.

---

## How the dataset was built

The upstream artifact is Prime Intellect's raw archive: 10,380 training
runs, 646 idea writeups, 337,648 validation checkpoints, across 8
scratchpads (2 agents × 4 waves). The build pipeline turns that into a
clean evaluation dataset.

**1. Parse everything.** Walk each scratchpad, normalize per-wave schemas,
parse training logs into per-checkpoint validation trajectories. Backfill
timestamps from log headers (`### START_ISO=`) — recovers real launch times
for 99.5% of runs.

**2. Link runs to ideas.** Five-tier matcher (heuristic prefix, abbreviation
index, THREAD walk, LLM matcher, dedup) links 63% of runs to a formal idea
writeup.

**3. Recover variant.py for every run** — four cascade passes:

| pass | trick | recovered |
|---|---|---|
| extract from log | scripts dump source to stdout before a `====` delimiter | 1,120 |
| dynamo warnings | pytorch's recompile logs include the variant.py path | 1,981 |
| sbatch stubs | each `<run_id>.sh` carries `--variant ".../<file>.py"` | 471 |
| name-stem fuzzy | `<mechanism>-s0` → `train_gpt_simple_<mechanism>.py` | 1,630 |

Net: **54% → 95.6% of runs have variant.py linkage.**

**4. Build the experiments catalog.** One row per training run (10,380),
with timestamp, verdict (improved/neutral/regressed/inconclusive), best
step count, linked variant + idea.

**5. Write a description for every experiment.** Three sources, in priority:
parent idea writeup (~6,580 rows), variant docstring, or — for the ~3,800
runs without either — an LLM (gpt-4o-mini, ~$0.26 total) decodes the run_id
shorthand + log preamble into a 1-2 sentence mechanism summary.

**6. Embed every description.** Two backends shipped: OpenAI
`text-embedding-3-large` (3072-dim, $0.51 total) and BGE-large (1024-dim,
free local fallback).

**7. Build 40 snapshots.** 5 time-anchored evaluation positions per scope
(at the 10/30/50/70/90 percentile of the scope's run arc). Each snapshot
freezes: priors visible by time T, future improvements still ahead,
rejected ideas the agent had killed by T, cross-agent visible work,
current best recipe.

**8. Calibrate the judge.** Generate 78 labeled cases (sampled from priors,
futures, paraphrases, cross-scopes, malformed stubs) → sweep cosine
threshold → tune to 0.75 (87% accuracy).

Reproduce from scratch:

```bash
git clone https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning <PI>
python -m novelty_bench.indexer                       --pi-repo <PI>
python -m novelty_bench.indexer.extract_embedded_sources --pi-repo <PI>
python -m novelty_bench.indexer.link_via_dynamo_warnings --pi-repo <PI>
python -m novelty_bench.indexer.link_via_sbatch_stubs    --pi-repo <PI>
python -m novelty_bench.indexer.link_via_name_stem       --pi-repo <PI>
python -m novelty_bench.indexer.experiments_llm_augment  --pi-repo <PI>
python -m novelty_bench.embeddings --table experiments --backend openai
python -m novelty_bench.embeddings --table experiments --backend bge
python -m novelty_bench.calibration.generate
python -m novelty_bench.calibration.evaluate --sweep
```

---

## Usage

```bash
pip install evo-autoresearch-novelty-bench
```

The parquet tables auto-download from
[`evo-hq/autoresearch-novelty-bench`](https://huggingface.co/datasets/evo-hq/autoresearch-novelty-bench)
on first use.

```python
import novelty_bench as nb

# Browse the dataset
nb.load_experiments(agent="codex", wave="v3", verdict="improved")
nb.load_snapshots(split="dev")     # 16 dev snapshots for tuning
nb.load_snapshots(split="test")    # 24 held-out test snapshots
nb.load_ideas()                    # 646 formal writeups

# Score one of your proposer's outputs
snap = nb.load_snapshots()
snap = snap[snap["snapshot_id"] == "snap_codex_v3_k0922"].iloc[0]
result = nb.score(snap, ["candidate_1.md", "candidate_2.md", "candidate_3.md"])
print(result.explain())
```

Render a working directory for your proposer to operate in (via the
scaffold repo):

```bash
git clone https://github.com/evo-hq/autoresearch-novelty-bench-scaffold
cd autoresearch-novelty-bench-scaffold
python build.py --snapshot snap_codex_v3_k0922 --out /tmp/workspace
# Your proposer agent reads /tmp/workspace/AGENTS.md and writes
# N candidates into /tmp/workspace/scratchpad/ideas/{slug}.md
```

The scaffold is **near-blank-slate**: the proposer sees the goal, the
wave's gating rule, the field-wide best step count at the snapshot's
wall-clock, and the current-best `variant.py` they're building on. Nothing
else — no peer ideas, no prior writeups, no future hints. They use their
own tools (web search, paper retrieval) to gather context. Why this minimal
context: PI's archive doesn't preserve per-file creation dates, so
mirroring the agent's broader scratchpad state at a past moment would
necessarily leak future work.

---

## Scoring

```
set_score = sum(per_candidate_scores)
          + 0.5 × diversity_bonus          # mean pairwise cosine distance between N candidates
          + 0.1 × validity_term            # fraction of N that passed the structural check
```

Per-candidate, the judge classifies in this priority order:

| order | outcome | score | when |
|---|---|---|---|
| 1 | **invalid** | −1.0 | missing `## Proposal` section, no title, or body < 50 chars |
| 2 | **novel_validated** | +impact × confidence | matches a future-improved experiment. impact ∈ {1.0 frontier_idea, 0.6 improved_idea, 0.5 frontier_experiment, 0.4 improved_experiment} |
| 3 | **rediscovery** | −0.5 × rejection_mult × confidence | matches a prior. rejection_mult: none 1.0, failed 1.4, family_ruled_out 1.6, audit_noncompliant 1.6, existence_killed 2.0 |
| 4 | **novel_unvalidated** | +0.3 × confidence | no future match, no prior match |

Future-first priority: copying a prior that ended up on a winning thread
counts as anticipation, not rediscovery.

**Default judge backend: `llm-hybrid`.** Cosine retrieval picks the top-20
nearest priors and top-20 nearest futures by `text-embedding-3-large`, then
**gpt-5-mini** (`reasoning_effort=medium`) classifies the candidate with
structured JSON output (classification + matched_id + confidence +
reasoning). The reasoning step catches paraphrases that cosine misses
(e.g. cosine 0.73 → LLM "novel_validated, confidence 0.95"). Final score
is multiplied by the LLM's confidence so weak matches are softened.

Cost: ~$0.005 per N=5 candidate set → ~$0.12 per 24-snapshot benchmark.

Opt-in to the deterministic backend with `--judge-backend cosine` for
zero-LLM, pinned-leaderboard runs (cosine ≥ 0.75 threshold, ~$0.0003/set).

---

## Limits

What we **can't** recreate from the upstream archive: the agent's
conversation with their orchestrator, their accumulated context window,
real-time search queries, decision rationale (we see *what* ran, not
*why*), PI operator interventions, cluster state, and the agent's training
cutoff. See the design notes for the full enumeration.

## License

Apache 2.0. Raw `ideas/*.md` and `variants/*.py` files belong to Prime
Intellect; this package ships only derived metadata.
