Metadata-Version: 2.4
Name: tooltree-mcts
Version: 0.1.0
Summary: Python implementation of ToolTree (ICLR 2026): MCTS-based tool planning for LLM agents with dual-feedback evaluation and bidirectional pruning
Author: ToolTree Team
License-Expression: MIT
License-File: LICENSE
Keywords: agents,llm,mcts,tool-planning,tree-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: httpx
Requires-Dist: pydantic>=2.0
Provides-Extra: all
Requires-Dist: tooltree[anthropic]; extra == 'all'
Requires-Dist: tooltree[dev]; extra == 'all'
Requires-Dist: tooltree[gemini]; extra == 'all'
Requires-Dist: tooltree[langgraph]; extra == 'all'
Requires-Dist: tooltree[nest]; extra == 'all'
Requires-Dist: tooltree[openai]; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: google-genai; extra == 'gemini'
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.2; extra == 'langgraph'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Provides-Extra: nest
Requires-Dist: nest-asyncio; extra == 'nest'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Description-Content-Type: text/markdown

# ToolTree

> **This is an open-source Python implementation of the paper:**
> [ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning](https://arxiv.org/abs/2603.12740)
> Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, Eduard Hoy — ICLR 2026
>
> All credit for the algorithm, equations, and core ideas belongs to the original authors.
> This repo provides a pip-installable implementation for practical use.

---

MCTS-based tool planning for LLM agents. Instead of picking tools greedily, ToolTree runs a Monte Carlo Tree Search over possible tool sequences, scores each path with dual LLM feedback, and returns the optimal multi-step plan.

---

## Why ToolTree

When an LLM agent needs to answer a question using tools, it typically tries one tool at a time and hopes for the best. This works for simple tasks but fails on multi-hop reasoning — where the output of tool A determines which tool to call next, and a wrong first choice wastes all subsequent calls.

ToolTree treats tool selection as a **search problem**:

- Explores multiple tool sequences in parallel
- Prunes bad branches *before* executing them (saves API calls)
- Prunes branches *after* seeing bad outputs (stops wasting rollouts)
- Caches duplicate tool calls across the tree
- Stops early when the search converges

---

## Install

```bash
# Core library (no LLM provider)
pip install tooltree

# With OpenAI (GPT-4o, GPT-5.1, etc.)
pip install tooltree[openai]

# With Google Gemini
pip install tooltree[gemini]

# With Anthropic Claude
pip install tooltree[anthropic]

# Everything
pip install tooltree[all]
```

Requires Python 3.10+.

---

## Quick Start

```python
from tooltree import ToolTreePlanner, Tool

# 1. Define your tools
def search_web(query: str) -> str:
    # your implementation
    return f"Results for {query}"

search_tool = Tool(
    name="search_web",
    description="Search the web for factual information.",
    parameters={
        "query": {"type": "string", "description": "Search query"},
    },
    execute=search_web,
)

# 2. Create the planner
planner = ToolTreePlanner(
    tools=[search_tool],
    llm="openai:gpt-4o-mini",   # or "gemini:gemini-2.0-flash", "anthropic:claude-haiku-4-5-20251001"
)

# 3. Solve
result = planner.solve_sync("What is the boiling point of ethanol in Celsius?")

print(result.answer)
print(result.tool_trace)   # list of {tool, arguments, output}
```

Set your API key before running:

```bash
# OpenAI
export OPENAI_API_KEY="sk-..."

# Gemini
export GEMINI_API_KEY="..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
```

---

## Multi-hop Example

The real value of ToolTree shows on tasks that require chaining tools — where you can't know which tool to call second until you see the output of the first.

```python
from tooltree import ToolTreePlanner, Tool, Config

tools = [
    Tool(
        name="search_in_file",
        description="Search for a regex pattern in a file. Returns matching lines with line numbers.",
        parameters={
            "path":    {"type": "string", "description": "File path"},
            "pattern": {"type": "string", "description": "Regex pattern"},
        },
        execute=search_in_file,
    ),
    Tool(
        name="read_file_lines",
        description="Read a range of lines from a file. Use after search_in_file.",
        parameters={
            "path":       {"type": "string", "description": "File path"},
            "start_line": {"type": "string", "description": "First line (1-indexed)"},
            "end_line":   {"type": "string", "description": "Last line (inclusive)"},
        },
        execute=read_file_lines,
    ),
]

config = Config(
    max_rollouts=20,
    max_depth=6,
    tau_pre=0.3,
    tau_post=0.4,
)

planner = ToolTreePlanner(tools=tools, llm="openai:gpt-4o-mini", config=config)

result = planner.solve_sync(
    "What is the default value of max_rollouts in the Config class?"
)
# MCTS discovers: search_in_file -> read_file_lines (2-hop chain)
```

See [`examples/integration_test.py`](examples/integration_test.py) for a full runnable version with 4 test queries.

---

## Configuration

```python
from tooltree import Config

config = Config(
    max_rollouts=60,           # Max MCTS iterations (default: 60)
    max_depth=15,              # Max tool chain length (default: 15)
    tau_pre=0.3,               # Pre-pruning threshold — candidates below this are dropped before execution
    tau_post=0.4,              # Post-pruning threshold — branches below this stop expanding
    top_k=20,                  # Keep top-K candidates after pre-pruning
    lambda_exploration=1.4,    # UCT exploration constant (paper Eq. 1)
    early_stop_patience=10,    # Stop after N rollouts with no Q improvement
    early_stop_delta=1e-3,     # Minimum Q improvement to count as progress
    tool_timeout=30.0,         # Max seconds per tool call
    max_concurrent_llm_calls=5 # Parallel LLM calls during expand phase
)
```

| Parameter | Default | What it controls |
|---|---|---|
| `max_rollouts` | 60 | How many MCTS iterations to run. More = better quality, higher cost. |
| `max_depth` | 15 | Maximum number of tool calls in a single path. |
| `tau_pre` | 0.3 | Pre-execution pruning threshold. Raise to prune more aggressively. |
| `tau_post` | 0.4 | Post-execution pruning threshold. Raise to abandon bad branches faster. |
| `top_k` | 20 | Max candidates kept after pre-pruning. |
| `lambda_exploration` | 1.4 | UCT exploration constant. Higher = more exploration of untested paths. |
| `early_stop_patience` | 10 | Rollouts without improvement before stopping early. |
| `tool_timeout` | 30.0 | Seconds before a tool call is killed. |
| `max_concurrent_llm_calls` | 5 | Semaphore limit for parallel LLM calls (rate limit protection). |

---

## LLM Providers

| Provider | Spec prefix | Extra | API key env var |
|---|---|---|---|
| OpenAI | `openai:` | `tooltree[openai]` | `OPENAI_API_KEY` |
| Google Gemini | `gemini:` | `tooltree[gemini]` | `GEMINI_API_KEY` or `GOOGLE_API_KEY` |
| Anthropic Claude | `anthropic:` | `tooltree[anthropic]` | `ANTHROPIC_API_KEY` |

Pass as a string to `ToolTreePlanner`:

```python
planner = ToolTreePlanner(tools=tools, llm="openai:gpt-4o-mini")
planner = ToolTreePlanner(tools=tools, llm="gemini:gemini-2.0-flash")
planner = ToolTreePlanner(tools=tools, llm="anthropic:claude-haiku-4-5-20251001")
```

Or pass a provider instance directly:

```python
from tooltree.llm import create_llm

llm = create_llm("openai:gpt-4o-mini", temperature=0.2, max_tokens=512)
planner = ToolTreePlanner(tools=tools, llm=llm)
```

Use a separate, stronger model as the judge (evaluates tool outputs while a cheaper model drives search):

```python
planner = ToolTreePlanner(
    tools=tools,
    llm="openai:gpt-4o-mini",      # drives search (more calls)
    judge_llm="openai:gpt-4o",     # evaluates quality (fewer calls)
)
```

### Custom / Local LLMs

Any class that extends `LLMProvider` works:

```python
from tooltree.llm.base import LLMProvider

class MyProvider(LLMProvider):
    async def complete(self, messages: list[dict], **kwargs) -> str:
        # call your local model / API
        ...

planner = ToolTreePlanner(tools=tools, llm=MyProvider())
```

---

## API Reference

### `ToolTreePlanner`

```python
ToolTreePlanner(
    tools: Sequence[Tool],
    llm: str | LLMProvider = "gemini:gemini-2.0-flash",
    config: Config | None = None,
    judge_llm: str | LLMProvider | None = None,
)
```

| Method | Description |
|---|---|
| `solve(query) -> Awaitable[SearchResult]` | Async search |
| `solve_sync(query) -> SearchResult` | Sync wrapper — works in scripts, Jupyter, and nested event loops |

### `SearchResult`

```python
result.answer          # str — final answer synthesized from best trajectory
result.tool_trace      # list[dict] — [{tool, arguments, output}, ...]
result.total_rollouts  # int — iterations run
result.best_q          # float — best Q-value (0.0 to 1.0)
result.tree_stats      # dict — {total_nodes, max_depth_reached, early_stopped}
```

### `Tool`

```python
Tool(
    name: str,                          # e.g. "search_web"
    description: str,                   # Shown to the LLM — be specific
    parameters: dict[str, Any],         # JSON Schema-style: {param: {type, description}}
    execute: Callable,                  # Sync or async function
)
```

---

## How It Works

ToolTree runs MCTS over the space of possible tool call sequences:

1. **Select** — Traverse the tree via UCT score until a leaf node
2. **Expand** — Ask the LLM to generate candidate tool calls; pre-evaluate each (score before executing); drop candidates below `tau_pre`; keep top-K
3. **Execute** — Run the chosen tool with timeout; cache the result
4. **Post-evaluate** — Score the actual output; if below `tau_post`, mark the branch as non-expandable
5. **Backpropagate** — Update Q-values up to root
6. **Repeat** until `max_rollouts` or early stop

After search: extract the highest-Q path, ask the LLM to synthesize a final answer from the trajectory.

**Dual pruning** means bad tool calls are rejected at two checkpoints — before they waste API calls, and after they waste search budget. This keeps rollout cost low on multi-hop tasks where the naive approach would try many dead-end paths.

---

## Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all unit tests (no API key needed — uses MockLLM)
pytest tests/ -v

# Run integration test (requires API key)
TOOLTREE_MODEL=openai:gpt-4o-mini python -u examples/integration_test.py
```

---

## Project Structure

```
src/tooltree/
├── planner.py              # ToolTreePlanner — main entry point
├── config.py               # Config dataclass
├── core/
│   ├── tree.py             # Action, SearchState, MCTSNode
│   ├── mcts.py             # MCTSPlanner — full search loop
│   ├── pruning.py          # pre_prune(), post_prune()
│   └── cache.py            # ResultCache — deduplicates tool calls
├── evaluation/
│   ├── pre_evaluator.py    # Scores tool calls before execution
│   ├── post_evaluator.py   # Scores tool outputs after execution
│   └── prompts.py          # LLM judge prompts (paper Appendix B)
├── llm/
│   ├── base.py             # LLMProvider ABC, create_llm() factory
│   ├── openai_provider.py
│   ├── gemini_provider.py
│   └── anthropic_provider.py
└── tools/
    ├── schema.py           # Tool dataclass
    ├── registry.py         # ToolRegistry
    └── executor.py         # ToolExecutor with timeout
```

---

## License

MIT
