Metadata-Version: 2.4
Name: interactive-arc
Version: 0.1.2
Summary: Interactive-ARC benchmark for evaluating LLMs on ARC tasks
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: anthropic>=0.40
Requires-Dist: boto3>=1.35
Requires-Dist: click>=8.1
Requires-Dist: hypothesis>=6.0
Requires-Dist: jinja2>=3.1
Requires-Dist: matplotlib>=3.8
Requires-Dist: openai>=1.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: tenacity>=8.0
Description-Content-Type: text/markdown

# Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on [ARC-AGI](https://arcprize.org/) tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal *how* a model solves a task, not just *whether* it does.

## Features

- **Interactive evaluation**: models build solutions incrementally using grid-editing tools
- **Full action traces**: every tool call, grid state, and token count is recorded
- **1,120 public tasks**: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
- **Multiple providers**: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
- **Concurrent execution**: evaluates tasks in parallel with configurable concurrency
- **Checkpointing**: interrupted runs resume from where they left off

## Installation

```bash
pip install interactive-arc
```

Requires Python 3.12+.

## Quick Start

### With a cloud provider

```bash
# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0
```

### With a local model (vLLM, Ollama, etc.)

```bash
interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42
```

### Inspect a single task

```bash
interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514
```

## CLI Reference

```
interactive-arc run [OPTIONS]
```

| Option | Default | Description |
|--------|---------|-------------|
| `--dataset` | `arc-agi-1` | Dataset (`arc-agi-1` or `arc-agi-2`) |
| `--split` | `training` | Split (`training` or `evaluation`) |
| `--provider` | `bedrock` | LLM provider (`anthropic`, `bedrock`, `openai`) |
| `--model` | | Model identifier |
| `--base-url` | | Base URL for OpenAI-compatible endpoints |
| `--renderer` | `text` | Grid format sent to model (`text`, `json`, `markdown`) |
| `--sample` | all | Number of tasks to sample |
| `--seed` | | Random seed for reproducible sampling |
| `--output` | | Path for summary statistics JSON |
| `--traces` | `./traces` | Directory for full trace files |
| `--max-attempts` | `2` | Submission attempts per task (1-10) |
| `--enabled-tools` | all | Comma-separated subset of tools to enable |
| `--grid-feedback` | `both` | Grid state shown after actions (`both`, `output`, `none`) |

## Tools

Models interact with the grid through these tools:

| Tool | Description |
|------|-------------|
| `set_cell(x, y, color)` | Set a single cell |
| `set_width(width)` | Resize grid width |
| `set_height(height)` | Resize grid height |
| `flood_fill(x, y, color)` | Fill connected region |
| `copy_input()` | Copy test input to output grid |
| `copy_region(x, y, w, h)` | Copy a rectangular region to clipboard |
| `paste_region(x, y)` | Paste clipboard at position |
| `undo()` | Undo last operation |
| `reset()` | Reset grid to initial state |
| `submit(explanation)` | Submit current grid as answer |

## Python API

```python
from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")
```

## Output

Each run produces:

- **Summary JSON**: success rate, action efficiency, token usage, cost estimates
- **Trace files**: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

## Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

1. **Environment**: grid state machine, tool execution, task loading
2. **Agent**: multi-turn LLM interaction loop, provider adapters, grid renderers
3. **Runner**: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

## Development

```bash
git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/
```

## Licence

MIT
