Metadata-Version: 2.4
Name: surprisal-search
Version: 0.1.0
Summary: MCTS with Bayesian surprise for open-ended scientific discovery
Project-URL: Homepage, https://github.com/jbarnes850/surprisal
Project-URL: Repository, https://github.com/jbarnes850/surprisal
Author-email: Jarrod Barnes <jbarnes850@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: click>=8.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.11
Description-Content-Type: text/markdown

# Surprisal

MCTS with Bayesian surprise for open-ended scientific discovery.

Surprisal is inspired by AllenAI's [AutoDiscovery](https://github.com/allenai/autodiscovery) and the Surprisal-Guided Selection paper cited below. It explores a research domain by generating literature-grounded hypotheses, running bounded experiments in a sandbox with real tools and network access, and ranking branches by how much the evidence changes the model's beliefs.

## Quick start

```bash
curl -fsSL https://raw.githubusercontent.com/jbarnes850/surprisal/main/install.sh | bash

uv run surprisal init \
  --domain "AI for scientific discovery" \
  --seed "LLM self-evaluation accuracy drops as task compositional depth increases"

uv run surprisal explore --budget 10 --concurrency 1
uv run surprisal status --tree
uv run surprisal export --top 5 --format md
```

The default backend (`auto`) runs experiments directly on your host with no Docker dependency. Progress streams through generator, runner, review, and belief phases.

If you switch to `backend = "docker"` for sandboxed execution, Surprisal will build `surprisal-cpu:latest` on first run and prompt for a `claude setup-token` if your CLI auth is subscription-backed.

Codex-based analysis and review stages run from per-experiment workspaces under `/tmp/.../experiments/node_*`, so the CLI invocation explicitly skips git-repo enforcement there.

## What it does

Each expansion runs a per-node FSM:

1. `experiment_generator`: Claude searches recent literature and proposes one hypothesis plus one executable plan.
2. `experiment_runner`: a sandbox backend executes the plan with Python, Bash, local files, public network access, HuggingFace resources, and optional W&B logging.
3. `experiment_analyst`: Codex or Claude reviews the execution for fidelity and validity.
4. `experiment_reviewer`: Codex or Claude decides whether the evidence is usable.
5. `experiment_reviser`: if needed, the plan is revised and retried within configured bounds.
6. `hypothesis_generator`: Claude formalizes the post-experiment hypothesis record.
7. `belief_elicitation`: Claude samples prior and posterior binary judgments and Surprisal computes Bayesian surprise.

The deterministic MCTS layer never calls LLMs directly. It only consumes node state and reward signals.

## Runtime model

- Claude is required for research-facing roles: generator, hypothesis formalization, and belief elicitation.
- If Codex is available, it handles analysis, review, and revision roles.
- If Codex is not available, Claude handles all roles.
- Agent sessions persist per branch in `sessions.json`: Claude research sessions, code-analysis sessions, and runner sessions are tracked separately and resumed automatically across nodes on the same branch.
- Belief elicitation forks from the persisted research session instead of mutating it, so prior and posterior samples stay independent while still inheriting branch context.
- Experiment execution uses the configured sandbox backend:
  - `auto` (default): host-native runner, no Docker required, GPU autodetection
  - `docker`: Docker-based sandbox for isolated execution (requires Docker + `claude setup-token`)
  - `hf_jobs`: one-shot Hugging Face Jobs execution path for remote batch runs

## Commands

| Command | Purpose | Machine-readable output |
| --- | --- | --- |
| `surprisal init` | Create or reuse an exploration for a domain | `--json` |
| `surprisal explore` | Run exploration on the latest or a specific exploration | `--json` |
| `surprisal status` | Show exploration summary and optional tree | `--json` |
| `surprisal export` | Export results as markdown, CSV, JSON, or JSONL training data | `--format json` or `--json` |
| `surprisal resume` | Alias for `explore` against the latest or a specific exploration | `--json` |
| `surprisal prune` | Mark low-value branches as pruned | `--json` |
| `surprisal config` | Show, set, or reset config | `--json` |

`resume` resumes an exploration, not a per-agent conversational session.

## Architecture

Three layers:

1. `src/surprisal/mcts.py`
   Deterministic tree policy, UCT scoring, progressive widening, and backpropagation.
2. `src/surprisal/db.py`, `src/surprisal/exploration.py`, `src/surprisal/workspace.py`
   SQLite WAL persistence plus per-branch workspaces.
3. `src/surprisal/orchestrator.py`, `src/surprisal/fsm_runner.py`
   Async worker orchestration and the multi-agent experiment FSM.

Key files:

- `src/surprisal/fsm_runner.py`: per-node live FSM
- `src/surprisal/orchestrator.py`: worker pool, selection, branching, and dedup scheduling
- `src/surprisal/bayesian.py`: Beta posterior updates and belief-shift scoring
- `src/surprisal/prompts/`: prompt contracts for generator, runner, analyst, reviewer, reviser, and belief stages

## Configuration

Exploration state defaults to `~/.surprisal`.

Config is loaded from:

- `${SURPRISAL_HOME}/config.toml` when `SURPRISAL_HOME` is set
- `~/.surprisal/config.toml` when that file exists
- otherwise `${XDG_CONFIG_HOME:-~/.config}/surprisal/config.toml`

Show the active config:

```bash
uv run surprisal config --show
```

Live config knobs:

| Setting | Default | Description |
| --- | --- | --- |
| `general.default_budget` | `100` | Default exploration budget |
| `general.default_concurrency` | `2` | Default worker count |
| `mcts.c_explore` | `1.414` | UCT exploration constant |
| `mcts.k_progressive` | `1.0` | Progressive widening coefficient |
| `mcts.alpha_progressive` | `0.5` | Progressive widening exponent |
| `mcts.max_depth` | `30` | Maximum tree depth |
| `mcts.belief_samples` | `10` | Samples per prior and posterior belief phase (set higher for publication-grade runs) |
| `mcts.virtual_loss` | `2` | Virtual loss applied during parallel selection |
| `mcts.dedup_interval` | `50` | Run deduplication every N completed expansions |
| `agents.claude_model` | `opus` | Claude model for research roles |
| `agents.codex_model` | `gpt-5.4` | Codex model for analysis, review, and revision roles |
| `agents.max_turns` | `20` | Max Claude turns per invocation |
| `agents.code_attempts` | `6` | Total runner attempts before failure |
| `agents.revision_attempts` | `1` | Total plan revisions after rejection |
| `agents.generator_timeout` | `180` | Generator timeout in seconds |
| `sandbox.backend` | `auto` | `auto` (host-native, recommended), `docker` (sandboxed), or `hf_jobs` (remote) |
| `sandbox.image` | `auto` | Docker sandbox image tag (only used with `backend = "docker"`) |
| `sandbox.gpu` | `true` | Enable GPU passthrough for the Docker sandbox |
| `sandbox.memory_limit` | `16g` | Docker sandbox memory limit |
| `sandbox.cpu_limit` | `4` | Docker sandbox CPU limit |
| `sandbox.timeout` | `1800` | Sandbox timeout in seconds |
| `sandbox.network` | `true` | Allow public network access in the sandbox |
| `sandbox.hf_flavor` | `a10g-small` | HF Jobs hardware flavor |
| `sandbox.hf_timeout` | `2h` | HF Jobs timeout |
| `belief.provider` | `claude` | Belief elicitation provider: `claude` (Likert sampling) or `openrouter` (logprob-based) |
| `belief.model` | `""` | OpenRouter model ID for belief elicitation (e.g., `minimax/minimax-m2.5`) |
| `belief.samples` | `30` | Samples per prior and posterior belief phase |
| `belief.kl_scale` | `5.0` | KL divergence scaling factor for Bayesian surprise |
| `belief.evidence_weight` | `2.0` | Evidence weight for posterior Beta fitting |
| `credentials.wandb_api_key` | `""` | Optional W&B API key |
| `credentials.hf_token` | `""` | Optional HuggingFace token |
| `credentials.claude_oauth_token` | `""` | Cached Claude OAuth token for Docker runner (auto-prompted on first run) |

## Belief calibration

Surprisal computes Bayesian surprise by comparing prior and posterior belief distributions. Two providers are available:

- **Claude** (default): Samples Likert-scale judgments (`definitely_true` through `definitely_false`) via concurrent Claude calls. Higher fidelity but more API calls.
- **OpenRouter**: Single-call logprob-based estimation. Faster and cheaper. Requires an OpenRouter API key.

To use OpenRouter belief elicitation:

```bash
cp .env.example .env
# Add your OpenRouter API key to .env

uv run surprisal config --set belief.provider openrouter
uv run surprisal config --set belief.model minimax/minimax-m2.5
```

Prior beliefs are clamped to [0.1, 0.9] to prevent degenerate Beta distributions from overconfident models. A calibration warning is logged when clamping shifts the prior mean by more than 0.05.

## Literature grounding

The generator prefers alphaxiv MCP when available and falls back to the HuggingFace Papers API otherwise.

One-time alphaxiv setup:

```bash
claude mcp add --transport http alphaxiv https://api.alphaxiv.org/mcp/v1
```

Each hypothesis stores the papers that motivated it.

## Validation

Run the test suite:

```bash
uv run pytest tests/ -q --tb=short
```

## References

- Agarwal et al., [AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise](https://openreview.net/forum?id=kJqTkj2HhF)
- Shi and Evans, [Surprising combinations of research contents and contexts are related to impact](https://www.nature.com/articles/s41467-023-36741-4)
- Barnes et al., [Surprisal-Guided Selection](https://arxiv.org/abs/2602.07670)

## License

MIT
