Metadata-Version: 2.4
Name: openrlm
Version: 0.1.0
Summary: Recursive LLM agent harness and CLI with a persistent IPython REPL
Author-email: Shankar <orthogonal.eigenvector@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mailshanx/openrlm
Project-URL: Repository, https://github.com/mailshanx/openrlm
Project-URL: Issues, https://github.com/mailshanx/openrlm/issues
Keywords: llm,agent,cli,repl,sandbox,recursive
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiodocker>=0.23.0
Requires-Dist: openrouter>=0.6.0
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: httpx[http2]>=0.28.1
Requires-Dist: anthropic>=0.83.0
Provides-Extra: contrib
Requires-Dist: parallel-web>=0.3.4; extra == "contrib"
Dynamic: license-file

# openrlm

A recursive language model (RLM) agent harness with persistent IPython REPL environments. Usable as a CLI, embedded in an existing harness, or as a library.

Each agent gets a stateful IPython environment where it can persist variables, define functions, and run computations across multiple turns. Agents can programmatically spawn sub-agents, each with their own isolated REPL, to arbitrary depth.

## Why

**Why RLM?** The [Recursive Language Model](https://arxiv.org/abs/2512.24601) paper from MIT shows that recursive decomposition significantly improves performance on long-context and complex reasoning tasks. An agent that can spawn sub-agents to handle sub-problems — each with their own scratch space — outperforms flat agent loops.

**Why this implementation?** The original RLM implementation treats the user prompt as a variable the LLM greps and chunks. In practice, you want agents to operate on files and data in an application-specific context with custom tools. This implementation provides:

- **Custom host functions.** Define tools (search, APIs, domain-specific operations) that execute on the host but appear as plain async Python functions inside the agent's REPL. Serialization is invisible to the LLM.
- **Persistent REPL state.** Agents persist data to an IPython environment they can access across turns — variables, imports, and function definitions all survive between tool calls. Some ARC-AGI implementations demonstrated superior performance with this pattern, but lacked recursive sub-agents.
- **Cheap sub-agent spawning.** Sub-agents are forked processes. The fork server pre-imports expensive packages (numpy, pandas, etc.), then calls `gc.freeze()` before forking. Children inherit all imported modules via copy-on-write pages, and `gc.freeze()` prevents the garbage collector from scanning those objects — which would dirty the pages and force real memory copies. The OS only allocates memory for new data each sub-agent creates. A single machine can support hundreds to thousands of concurrent sub-agents.

## Architecture
openrlm has two layers: a **core** that handles single-message execution, and a **harness** that adds multi-turn state management on top.

### Core: one-shot execution

The core takes a single message and runs a complete LLM↔REPL loop: the LLM emits a `python` tool call, the core executes it in a persistent IPython sandbox, returns the output, and repeats until the LLM responds with text. Everything — computation, file I/O, web requests, sub-agent orchestration — is Python code the LLM writes and runs through that single tool. Host functions you register appear as plain `await fn(...)` calls inside the sandbox. Sub-agent functions (`create_agent`, `run_agent`, `await_result`) work the same way — the LLM doesn’t know these are remote calls.

`AgentRuntime` owns the infrastructure: the fork server (process lifecycle), the host function server (HTTP bridge for custom tools), and sub-agent routing. The LLM client is pluggable — you provide any implementation of the `LLMClient` protocol. It routes sub-agent calls through flat lookup tables so agents at any nesting depth resolve to the correct session.

```python
import asyncio
from openrlm import build_runtime

async def main():
    runtime = build_runtime(model="openai/gpt-5.2")

    async with runtime:
        session = await runtime.create_session("my-session")
        # One message in, one result out. The core handles the full LLM loop.
        result = await session.run_single("Compute the first 20 prime numbers")
        print(result)
        await runtime.close_session("my-session")

asyncio.run(main())
```

The fork server pre-imports expensive packages once, calls `gc.freeze()`, then forks a child process for each sandbox. Children share pre-imported module memory via OS-level copy-on-write. Host functions registered on the caller side are injected as async stubs into each sandbox — the code inside calls `await my_function(...)` and it transparently round-trips to the host via HTTP.

**Two execution modes** (both use the same TCP-based protocol):

- **Local (default):** Fork server runs as a subprocess. No Docker required. The workspace directory defaults to cwd — files agents create appear on your filesystem.
- **Docker:** Fork server runs inside a container for isolation. Host directories are exposed via bind mounts. Use `--image` to enable.

### Built-in Agent Harness: multi-turn state management

For multi-turn conversations, call `run_single()` repeatedly on the same session. The harness manages what accumulates between turns:

- **Message history.** Each `run_single()` appends the user message and final assistant response. REPL state (variables, imports, computed results) also persists.
- **Message compression.** All messages from previous turns are preserved, but tool outputs are truncated to 20 lines / 1 KB. The current turn retains full tool call detail. The complete uncompressed history is available inside the REPL as `_conversation_history`.
- **Cancellation.** Cancelling a turn (via `asyncio.CancelledError` or Ctrl-C) rolls back message history to the last consistent checkpoint. Sub-agent tasks are cancelled transitively.

```python
async with runtime:
    session = await runtime.create_session("analysis")
    # Turn 1: agent loads data, stores DataFrame in a REPL variable
    response = await session.run_single("Load data.csv and show me the column names")
    print(response)  # "The file has columns: date, product, price, volume ..."

    # Turn 2: agent reuses the loaded DataFrame — no re-reading needed
    response = await session.run_single("What's the correlation between price and volume?")
    print(response)  # "The Pearson correlation is 0.73 ..."

    # Turn 3: agent builds on all prior computed state
    response = await session.run_single("Plot the top 5 outliers and save to outliers.png")
    print(response)  # "Saved outliers.png with 5 data points highlighted ..."
    await runtime.close_session("analysis")
```

The caller provides user messages and consumes response strings. Everything else — message accumulation, compression, tool execution, history sync — happens inside the Session. Each Session is independent; multiple Sessions can run concurrently on the same Runtime.

**Custom Harness and Agent Implementations** 

If you need to manage message history yourself — injecting context between turns, forking conversations, external history storage — use `session.run_turn(messages, user_message)` instead of `run_single`. You construct the message list starting with `session.system_message`, pass it to each turn, and freely modify it between turns. The engine borrows the list during a turn and returns it enriched. `run_single` is a convenience wrapper that uses an engine-internal list.

```python
async with runtime:
    session = await runtime.create_session("analysis")
    messages = [session.system_message]

    result = await session.run_turn(messages, "Load data.csv")
    print(result)

    # Inject context between turns
    messages.append({"role": "user", "content": "(Note: focus on Q4 data)"})
    messages.append({"role": "assistant", "content": "Understood."})

    result = await session.run_turn(messages, "Summarize revenue trends")
    print(result)
```

## Installation

```bash
pip install openrlm
# or
uv pip install openrlm
```

To use the bundled internet search/extract tools:

```bash
pip install openrlm[contrib]
```

To use Docker mode, build the sandbox image:
```bash
openrlm --build-image
```

This builds `openrlm:sandbox` using `sandbox-deps.txt` if present in the current directory. To customize:

```bash
# Custom tag
openrlm --build-image my-image:latest

# Custom dependencies file
openrlm --build-image --sandbox-deps my-deps.txt
```

## Quickstart

### CLI

```bash
# Single message (local mode, default)
openrlm "compute the first 20 prime numbers"

# Interactive session
openrlm

# With a specific model (routed through OpenRouter)
openrlm --model anthropic/claude-sonnet-4-5 "explain main.py"

# With custom tools
openrlm --functions ./my-tools "use my_search to find X"

# With bundled contrib tools (requires PARALLEL_API_KEY)
openrlm --functions ./contrib "search for recent advances in fusion energy"

# Docker mode
openrlm --image openrlm:sandbox "analyze data"

# JSON output for programmatic use
openrlm --json "compute pi to 50 digits" | jq .result

# With conversation context from a prior session
openrlm --context history.json "continue the analysis"
```

### Library
`build_runtime()` is the main entry point for programmatic use. It handles LLM client selection, API key resolution, and host function loading — the same wiring the CLI does internally. Its keyword arguments correspond to the CLI flags:

```python
import asyncio
from openrlm import build_runtime

async def main():
    runtime = build_runtime(
        provider="anthropic",
        model="claude-sonnet-4-5",
        functions=["./my-tools"],
    )

    async with runtime:
        session = await runtime.create_session("s1", on_event=my_handler)
        result = await session.run_single("analyze this dataset")
        print(result)

        # Same session, same REPL state — variables from turn 1 persist
        result = await session.run_single("now visualize the outliers")
        print(result)

        await runtime.close_session("s1")

asyncio.run(main())
```

When you need more control — a custom `LLMClient` implementation, programmatic host function registration, or non-default `AgentConfig` settings — construct the `AgentRuntime` directly:

```python
from openrlm import (
    AgentRuntime, AgentConfig, HostFunctionRegistry,
    AnthropicClient, default_api_key_resolver,
)

registry = HostFunctionRegistry()
registry.register("my_tool", my_async_function)

resolver = default_api_key_resolver()
config = AgentConfig(
    model="claude-sonnet-4-5-20250514",
    get_api_key=lambda: resolver("anthropic"),
    max_tool_rounds=100,
    max_sub_agent_depth=5,
)

runtime = AgentRuntime(config, registry, llm_client=AnthropicClient())
```

This is what `build_runtime` does internally. See the [LLM Client](#llm-client) section for implementing custom providers.

### Custom Host Functions
Define tools that execute on the host but appear as regular async functions inside the agent's REPL.

#### Library usage
For functions loaded from module files, `build_runtime` handles registration:

```python
runtime = build_runtime(functions=["my_tools.py", "./more-tools/"])
```

For programmatic registration (e.g., closures that capture application state), create the registry directly:

```python
import json
from openrlm import AgentRuntime, AgentConfig, HostFunctionRegistry

async def my_database_query(sql: str, limit: int = 100) -> str:
    """Execute a SQL query against the application database.
    Use this function to make DB queries, like result = await my_database_query(sql="SELECT * FROM users", limit=10)
    Returns results as a JSON string."""
    results = await db.execute(sql, limit=limit)
    return json.dumps(results)

registry = HostFunctionRegistry()
registry.register("my_database_query", my_database_query)

# The registry is passed to the runtime, which injects the functions into every agent's REPL
runtime = AgentRuntime(config, registry, llm_client=client)
```

Inside the agent's REPL, the function becomes callable as:

```python
result = await my_database_query(sql="SELECT * FROM users", limit=10)
```

The function's type hints and docstring are picked up automatically — Pydantic builds a JSON schema from the signature for the system prompt, and the docstring becomes the description the LLM sees. No separate schema definitions needed.

#### CLI usage (`--functions`)

When using the CLI, you don't create a registry yourself — the CLI creates one and needs a way to discover your functions. You provide a Python file (or directory of files) that exports a `register(registry)` function. The CLI calls it, passing its own `HostFunctionRegistry` instance:

```python
# my_tools.py
import json

async def my_database_query(sql: str, limit: int = 100) -> str:
    """Execute a SQL query against the application database.

    Returns results as a JSON string."""
    results = await db.execute(sql, limit=limit)
    return json.dumps(results)

def register(registry):
    """Called by the CLI with its HostFunctionRegistry. Register your functions here."""
    registry.register("my_database_query", my_database_query)
```

Then:

```bash
openrlm --functions my_tools.py "show me the top 10 users"
```

For a directory of tool files, each `.py` file with a `register()` function is loaded automatically (files starting with `_` are skipped):

```bash
openrlm --functions ./my-tools/ "analyze the data"
```

You can also use a dotted module name for installed packages:

```bash
openrlm --functions my_package.tools "do something"
```

### Event Streaming

Monitor agent activity with event callbacks. Events from sub-agents at any depth flow through the same callback, distinguished by `agent_id`:

```python
from openrlm import build_runtime, EventCallback
from openrlm.events import RoundStart, ToolExecEnd, TurnEnd

def on_event(event):
    match event:
        case RoundStart(agent_id=aid, round_num=n):
            print(f"[{aid}] Round {n}")
        case ToolExecEnd(agent_id=aid, elapsed_seconds=t):
            print(f"[{aid}] Tool execution: {t:.1f}s")
        case TurnEnd(agent_id=aid, rounds=r, prompt_tokens=pt, completion_tokens=ct):
            print(f"[{aid}] Done in {r} rounds, {pt}+{ct} tokens")

async with runtime:
    session = await runtime.create_session("s1", on_event=on_event)
    await session.run_single("analyze this dataset")
```

The `on_event` parameter accepts any `EventCallback` (`Callable[[AgentEvent], None]`). For multiple consumers or async I/O, use `EventBus`:

```python
from openrlm import EventBus

bus = EventBus()
bus.add_listener(tui.update_panel)       # sync: immediate UI update
bus.add_listener(metrics.record_event)   # sync: bookkeeping

# Async consumers get an independent stream
stream = bus.stream(maxsize=256)

session = await runtime.create_session("s1", on_event=bus.callback)

# Consume asynchronously in a background task
async def push_events():
    async for event in stream:
        await websocket.send(serialize(event))
asyncio.create_task(push_events())

result = await session.run_single("analyze data")
bus.close()  # terminates async iteration
```

Each listener and stream is independent — a slow or failing consumer does not affect the engine or other consumers.

## Sub-agents

Agents can spawn sub-agents programmatically from within the REPL:

```python
# Create a sub-agent with specific instructions
agent_id = await create_agent(instructions="You are a citation specialist")

# Start a task (non-blocking — runs in the background)
task_id = await run_agent(agent_id=agent_id, task="Research citation percentiles for federal courts")

# Do other work while sub-agent runs...

# Collect the result
result = await await_result(task_id)
```

Sub-agents can themselves spawn sub-agents, enabling recursive decomposition. Each sub-agent has:
- Its own isolated IPython namespace
- Its own conversation history with the LLM
- Access to the same host functions and shared workspace directory
- A per-agent lock that serializes concurrent tasks on the same sub-agent

**Persistent sub-agents.** A sub-agent created with `create_agent` persists across multiple `run_agent` calls. Each task appends a new user message and runs a full agent turn, so the sub-agent sees its full prior conversation and retains all REPL state (variables, imports, computed data) from previous tasks. This makes sub-agents useful as persistent specialists:

```python
analyst = await create_agent(instructions="You are a data analyst")

t1 = await run_agent(agent_id=analyst, task="Load sales.csv and compute monthly totals")
await await_result(t1)

# The analyst still has the loaded data and computed totals in its REPL
t2 = await run_agent(agent_id=analyst, task="Now find the month-over-month growth rate")
growth = await await_result(t2)
```

The maximum recursion depth is configurable (default: 10 levels).

## Configuration

### `AgentConfig`

`build_runtime()` constructs an `AgentConfig` internally from its keyword arguments. Direct `AgentConfig` construction is only needed when building the `AgentRuntime` manually.

| Parameter | Default | Description |
|---|---|---|
| `model` | `"openai/gpt-5.2"` | Model identifier |
| `sandbox_image` | `None` | Docker image tag; `None` for local mode |
| `code_timeout` | `3600.0` | Code execution timeout in seconds |
| `max_tool_rounds` | `50` | Max LLM-tool iterations per turn |
| `max_sub_agent_depth` | `10` | Max recursive sub-agent depth |
| `output_limit_lines` | `2000` | Truncate tool output beyond this many lines |
| `output_limit_bytes` | `50000` | Truncate tool output beyond this many bytes |
| `temperature` | `None` | LLM sampling temperature |
| `system_prompt` | `None` | Override the default system prompt (format string with `{functions_json}`, `{workspace_path}`, `{spool_path}` placeholders) |
| `get_api_key` | `None` | `Callable[[], Awaitable[str]]` that returns an API key; required when using `AgentRuntime` |
| `sandbox_binds` | `{}` | Host-to-container directory mounts (Docker mode) |
| `task_preview_chars` | `12000` | Max characters of a sub-agent task shown in system prompt previews |

### API Key Resolution

`AgentConfig.get_api_key` is caller-provided. The bundled `default_api_key_resolver()` checks these sources in order:

1. **Auth file** (`~/.openrlm/auth.json`, override with `OPENRLM_AUTH_FILE`): a JSON object mapping provider names to keys.
2. **`ANTHROPIC_OAUTH_TOKEN`** (Anthropic only, legacy compatibility).
3. **Provider-specific environment variable:**
| Provider | Environment Variable |
|---|---|
| `openrouter` | `OPENROUTER_API_KEY` |
| `anthropic` | `ANTHROPIC_API_KEY` |
| `openai` | `OPENAI_API_KEY` |
| `google` | `GEMINI_API_KEY` |
| `groq` | `GROQ_API_KEY` |
| `xai` | `XAI_API_KEY` |
| `mistral` | `MISTRAL_API_KEY` |
| `openai-codex` | `OPENAI_CODEX_TOKEN` |

A `.env` file in the current directory is loaded automatically.

For the bundled contrib tools (`internet_search`, `internet_extract`), set `PARALLEL_API_KEY`.

### CLI Flags

```
openrlm [message] [options]

positional:
  message                 User message (omit for interactive session)

options:
  --model MODEL           Model identifier (default: openai/gpt-5.2)
  --provider PROVIDER     LLM provider (default: openrouter)
  --image IMAGE           Docker image tag for sandbox (omit for local mode)
  --timeout SECONDS       Code execution timeout (default: 3600)
  --max-rounds N          Max tool loop iterations (default: 50)
  --functions PATH        Directory, .py file, or dotted module name (comma-separated)
  --workspace DIR         Working directory shared with agents (default: cwd)
  --context FILE          JSON file with conversation history to prepend
  --json                  Output result as JSON object
  --verbose               Enable debug logging
  --env-file PATH         Path to .env file (default: .env)
  --log-file PATH         Log file path (default: ~/Downloads/openrlm.log)
  --build-image [TAG]   Build Docker sandbox image and exit (default: openrlm:sandbox)
  --sandbox-deps FILE   Dependencies file for --build-image (default: sandbox-deps.txt)
  --reasoning-effort E  Reasoning effort for Codex models: none, minimal, low, medium, high, xhigh (default: medium)
  --text-verbosity V    Text verbosity for Codex models: low, medium, high (default: medium)
```

#### `--context FILE`

Prepends conversation history after the system prompt. The file must contain a JSON array of messages:

```json
[
  {"role": "user", "content": "I'm analyzing sales data"},
  {"role": "assistant", "content": "I see the file has 10k rows with date, product, and revenue columns."}
]
```

Only `"user"` and `"assistant"` roles are allowed. This is useful for context bridging from an outer harness — pass a filtered conversation history so the agent understands what's been discussed.

#### `--json`

Wraps the result in a JSON object for programmatic consumption. Only valid with a message argument (not interactive mode).

```json
// Success
{"result": "The answer is 42", "error": null}

// Failure
{"result": null, "error": "No API key for provider 'openrouter'. Set the OPENROUTER_API_KEY environment variable or add it to ~/.openrlm/auth.json."}
```

Exactly one of `result` or `error` is non-null.

#### Interactive mode

- **Ctrl-C** cancels the active turn and returns to the prompt.
- **Double Ctrl-C** exits the session.

## Bundled Tools

The `contrib/` directory includes two pre-built host functions that use the [Parallel API](https://www.parallel.ai/):

- **`internet_search`** — Search the web and return relevant excerpts with source URLs.
- **`internet_extract`** — Fetch a web page or PDF and return its content as markdown.

Both require `PARALLEL_API_KEY` in the environment and the `contrib` extra (`pip install openrlm[contrib]`).

```bash
openrlm --functions ./contrib "search for recent papers on transformer efficiency"
```

## LLM Client

openrlm ships with two bundled client implementations:
- **OpenRouterClient** — routes to models from OpenAI, Anthropic, Google, and others through a single API.
- **AnthropicClient** — calls the Anthropic API directly.

`build_runtime(provider="anthropic")` or `build_runtime(provider="openrouter")` selects the appropriate client automatically. Manual client construction is only needed for custom `LLMClient` implementations.

To implement a custom provider:

```python
from openrlm import AgentRuntime, AgentConfig, HostFunctionRegistry
from openrlm import LLMClient, CompletionResponse, CompletionChoice, CompletionMessage, TokenUsage, default_api_key_resolver

class MyCustomClient:
    """Example: implement LLMClient for a provider not built in."""

    async def complete(self, messages, *, api_key, **kwargs) -> CompletionResponse:
        # Call your provider's API, then translate the response:
        return CompletionResponse(
            model="my-model",
            choices=[CompletionChoice(
                message=CompletionMessage(content="...", tool_calls=None),
                finish_reason="stop",
            )],
            usage=TokenUsage(prompt_tokens=0, completion_tokens=0),
        )

    async def close(self) -> None:
        pass  # Release any resources

async def main():
    client = MyCustomClient()
    resolver = default_api_key_resolver()
    config = AgentConfig(get_api_key=lambda: resolver("my-provider"))
    runtime = AgentRuntime(config, HostFunctionRegistry(), llm_client=client)

    async with runtime:
        session = await runtime.create_session("s1")
        result = await session.run_single("What is 2 + 2?")
        print(result)
        await runtime.close_session("s1")
    # Runtime closes the LLM client on exit.
```

## Development

```bash
# Clone and install
git clone <repo-url>
cd openrlm
uv sync

# Run tests (requires Docker for full suite)
uv run python tests/test_e2e.py

# Build sandbox image (for Docker mode tests)
openrlm --build-image
```

## License

MIT
