Metadata-Version: 2.3
Name: autoanything
Version: 0.1.0
Summary: Automaxxing for AI agent swarms
Requires-Dist: click>=8.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.32.0
Requires-Dist: matplotlib>=3.10
Requires-Dist: rich>=13.0
Requires-Dist: uvicorn>=0.34.0
Requires-Dist: httpx>=0.27 ; extra == 'dev'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: kernels>=0.11.7 ; extra == 'gpu'
Requires-Dist: numpy>=2.2.6 ; extra == 'gpu'
Requires-Dist: pandas>=2.3.3 ; extra == 'gpu'
Requires-Dist: pyarrow>=21.0.0 ; extra == 'gpu'
Requires-Dist: rustbpe>=0.1.0 ; extra == 'gpu'
Requires-Dist: tiktoken>=0.11.0 ; extra == 'gpu'
Requires-Dist: torch==2.9.1 ; extra == 'gpu'
Requires-Dist: anthropic>=0.45 ; extra == 'llm'
Requires-Dist: openai>=1.60 ; extra == 'llm'
Requires-Python: >=3.10
Provides-Extra: dev
Provides-Extra: gpu
Provides-Extra: llm
Description-Content-Type: text/markdown

# AutoAnything

This project started from Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) — a single AI agent in a loop, optimizing a GPT training script against validation bits-per-byte on one GPU. The agent would modify `train.py`, run training for five minutes, check the score, and keep the change if it improved. Simple evolutionary search powered by an LLM instead of random mutations.

The insight wasn't the ML part. It was the loop: propose a change, score it against an objective function, keep it if it's better, throw it away if it's not. That loop works for anything with a number you can measure. The code changes are the mutations, the scoring function is the fitness landscape, and the LLM is a mutation operator that actually understands what it's changing.

**AutoAnything** generalizes that loop. The mutable state can be any file. The scoring function can be any program that outputs a number. And agents can be anything that can `git push` — Claude Code, Codex, Cursor, a human with vim, a shell script. You define the scoring function and a direction. AutoAnything is just the plumbing.

| GPT Training (val BPB) | Rastrigin Function (10-D) |
|:---:|:---:|
| ![GPT training optimization](images/progress.png) | ![Rastrigin function minimization](images/test_progress_rastrigin.png) |
| **Traveling Salesman (20 cities)** | **Rectangle Packing (12 rects)** |
| ![TSP route optimization](images/test_progress_tsp.png) | ![Rectangle packing optimization](images/test_progress_packing.png) |

*Each chart shows the same pattern: agents propose changes (grey dots), the evaluator keeps only improvements (green dots), and the best score ratchets monotonically in one direction.*

## Install

> **Note:** The repository and Python package are called `autoanything`. The CLI command it installs is `maxx`.

```bash
uv tool install autoanything
```

For development (editable install):

```bash
git clone https://github.com/kousun12/autoanything
cd autoanything
uv sync
```

## Quick start

Try an example problem in one command:

```bash
maxx try fib              # built-in demo agent, generates progress chart
maxx try rastrigin --claude  # use Claude as the agent
```

Or create your own problem:

```bash
# Create a new problem
maxx init my-problem --direction minimize
cd my-problem

# Edit the scaffolded files
#   problem.yaml       — describe the problem
#   state/             — set up the initial mutable state (any files)
#   scoring/score.py   — implement your score() function

# Check everything is wired up
maxx validate

# Run scoring once as a sanity check
maxx score

# Run a local optimization loop — single machine, one agent
maxx run -a "claude -p 'read agent_instructions.md and improve the solution'"
```

The `run` command handles everything: it runs your agent, scores the result, keeps improvements, updates the leaderboard, and loops. The scoring directory is hidden from the agent during execution. This is the simplest way to use AutoAnything.

To scale up to multiple agents submitting concurrently, use the evaluator:

```bash
maxx evaluate --baseline-only   # establish baseline
maxx evaluate                   # start evaluation loop (watches for proposal branches)
```

## How it works

```mermaid
flowchart TD
    read["Read problem, leaderboard, context"] --> edit["Modify state/ files"]
    edit --> push["Push proposal branch"]
    push --> eval["Evaluator picks up branch"]
    eval --> score["Run score.py"]
    score --> check{"Improved?"}
    check -- Yes --> merge["Merge to main + update leaderboard"]
    check -- No --> discard["Discard branch"]
    merge --> read
    discard --> read
```

**Agents** clone the repo, read the problem definition and leaderboard, modify the mutable files, and push a branch (`proposals/<name>/<description>`) or open a PR. They never see the scoring code.

**The evaluator** watches for new branches or PRs, scores them one at a time (serial queue), and either merges (if improved) or discards/closes. The scoring code, test data, and history DB are all private (gitignored).

## Problem structure

A problem is a self-contained directory (typically its own git repo):

```
my-problem/
├── problem.yaml            # Problem definition + framework config
├── agent_instructions.md   # Protocol for agents (generated by init)
├── state/                  # Mutable files — agents can create, modify, or delete
│   └── ...                 # Any files; the scoring function decides how to read them
├── context/                # Read-only background for agents
├── scoring/                # GITIGNORED — private scoring code
│   └── score.py            # Implement score() → dict
├── leaderboard.md          # Auto-updated by the evaluator
└── .autoanything/          # GITIGNORED — local evaluator state
    └── history.db          # SQLite evaluation history
```

```mermaid
flowchart LR
    subgraph visible ["Agents See"]
        A["problem.yaml"]
        B["state/"]
        C["context/"]
        D["leaderboard.md"]
    end
    subgraph hidden ["Agents Never See"]
        E["scoring/score.py"]
        F[".autoanything/history.db"]
    end
    B -->|"proposals"| E
    E -->|"results"| D
```

The `scoring/` directory is never committed — it exists only on the evaluation machine. Agents see the metric name and direction (from `problem.yaml`) and other agents' scores (from `leaderboard.md`), but never the scoring implementation.

## CLI reference

| Command | Description |
|---------|-------------|
| `maxx try <problem>` | Try an example problem (demo agent, generates chart) |
| `maxx try <problem> --claude` | Try an example with Claude as the agent |
| `maxx init <name>` | Scaffold a new problem directory |
| `maxx validate` | Check that the problem directory is well-formed |
| `maxx score` | Run `scoring/score.py` once and print the result |
| `maxx run -a "<cmd>"` | Run the local optimization loop with an agent command |
| `maxx evaluate` | Start the polling evaluator (watches for proposal branches) |
| `maxx evaluate --baseline-only` | Establish baseline score and exit |
| `maxx serve` | Start the webhook server (receives PR events) |
| `maxx history` | Print evaluation history from the DB |
| `maxx leaderboard` | Regenerate `leaderboard.md` from history |
| `maxx plot` | Generate a progress chart from evaluation history |

All commands operate on the current directory by default (overridable with `--dir`).

### Evaluator modes

**Local loop** — single machine, one agent, fully automated:

```bash
maxx run -a "./my_agent.sh"                           # run until stopped
maxx run -a "python optimize.py" -n 50                # limit to 50 iterations
maxx run -a "claude -p 'improve the solution'" -n 10  # use any command as the agent
```

**Polling** — watches for proposal branches matching `proposals/*`:

```bash
maxx evaluate --baseline-only   # establish baseline
maxx evaluate                   # start evaluation loop
maxx evaluate --push            # push leaderboard updates to origin
```

**Webhook** — receives GitHub PR events via HTTP:

```bash
maxx evaluate --baseline-only   # establish baseline first
maxx serve --push               # start webhook server

# Configure the GitHub webhook:
#   URL: https://<your-domain>/webhook
#   Content type: application/json
#   Secret: (set matching WEBHOOK_SECRET env var on the server)
#   Events: Pull requests only
```

### Progress charts

```bash
maxx plot                         # chart from .autoanything/history.db
maxx plot --db path/to/history.db  # chart from a specific database
maxx plot -o chart.png            # save to a specific path
```

## Running agents

Point any AI agent at the problem repo. They should read `agent_instructions.md` for the protocol:

```
Read agent_instructions.md and start optimizing. Check the leaderboard first.
```

Agents create branches like `proposals/agent-1/higher-lr` and push them, or open PRs targeting main. The evaluator picks them up automatically.

### Agent environment variables

When using `maxx run`, the framework sets these environment variables before each agent invocation:

| Variable | Description | Example |
|----------|-------------|---------|
| `AUTOANYTHING_ITERATION` | Current iteration number (1-indexed) | `3` |
| `AUTOANYTHING_SCORE` | Current best score | `169.743` |
| `AUTOANYTHING_DIRECTION` | Optimization direction | `minimize` |
| `AUTOANYTHING_METRIC` | Name of the score metric | `score` |
| `AUTOANYTHING_PROBLEM` | Problem name from `problem.yaml` | `rastrigin` |

### Writing a custom agent

An agent can be any command — a shell script, a Python script, a call to an AI tool. The agent runs in the problem directory, modifies files in `state/`, and exits. The framework handles branching, scoring, and merging.

A minimal shell script agent:

```bash
#!/bin/bash
# agent.sh — read the current score, tweak state/solution.py, commit
echo "Iteration $AUTOANYTHING_ITERATION, current best: $AUTOANYTHING_SCORE"

python3 -c "
import random
# Read current state, make a random perturbation
exec(open('state/solution.py').read())
x = [v + random.gauss(0, 0.5) for v in x]
with open('state/solution.py', 'w') as f:
    f.write(f'x = {x}\n')
"

git add state/solution.py
git commit -m "Perturbation attempt $AUTOANYTHING_ITERATION"
```

```bash
maxx run -a "./agent.sh" -n 20
```

For AI-powered agents, the command can be anything that reads the problem and modifies state:

```bash
maxx run -a "claude -p 'read agent_instructions.md and improve the solution'" -n 10
```

## Example problems

The [`examples/`](examples/) directory contains five reference problems showing the structure:

| Problem | Description | Starting → Optimum | Requirements |
|---------|-------------|-------------------|-------------|
| `rastrigin` | Minimize 10-D Rastrigin function | ~169.7 → 0.0 | None |
| `tsp` | Shortest tour of 20 cities | ~1914 → ~680 | None |
| `packing` | Pack 12 rectangles into smallest box | 13250 → ~6975 | None |
| `fib` | Optimize Fibonacci for speed | ~1.0s → ~0.000001s | None |
| `gpt` | Optimize GPT training (val_bpb) | ~1.15 → ? | NVIDIA GPU |

The first four score instantly or near-instantly and need no GPU.

For runnable problems with evaluator support and simulated test runs, see [derby-examples](https://github.com/kousun12/derby-examples). See [`examples/README.md`](examples/README.md) for details on each problem's structure.

## Creating your own problem

The fastest way:

```bash
maxx init my-problem --direction minimize
cd my-problem
```

This scaffolds the full directory structure, initializes a git repo, and sets up `.gitignore` to exclude `scoring/` and `.autoanything/`. Then:

1. **Edit `problem.yaml`** — describe the problem.
2. **Edit files in `state/`** — set up the initial mutable state. You can rename, add, or remove files here; the scoring function decides what to read.
3. **Edit `scoring/score.py`** — implement your `score()` function. It must return a dict with at least the primary metric key (default: `"score"`).
4. **Run `maxx validate`** — check everything is wired up.
5. **Run `maxx score`** — run scoring once as a sanity check.

A minimal `problem.yaml`:

```yaml
name: my-problem
description: Minimize the cost function.
score:
  direction: minimize
```

For a full walkthrough with a complete runnable example, see [docs/create-problem.md](docs/create-problem.md). For guidance on writing scoring functions (including LLM-as-judge patterns), see [docs/scoring.md](docs/scoring.md).

## Design principles

### Minimum time to optimization

The hardest part of any optimization problem isn't the search — it's defining what "better" means. AutoAnything is designed so the time between "I have a problem" and "agents are working on it" is as short as possible. `init` scaffolds the structure. You fill in three things: what the problem is, what the starting state looks like, and how to score it. Then `run` handles everything else — branching, scoring, merging improvements, updating the leaderboard, looping. No infrastructure to set up, no agents to configure, no evaluation pipeline to build.

The goal is that your time goes to the only part that requires human judgment: thinking carefully about the scoring function and what values it encodes. Once that's right, the system runs without oversight. Agents propose, the evaluator decides, and the score ratchets forward.

### Blind scoring

Agents never see the scoring code. This is the single most important design decision.

If an optimizer can see the evaluation function, it will overfit to it — exploiting quirks in the metric, hardcoding known-good outputs, gaming the test set. This is the same reason you don't let students write the exam.

The separation is structural, not conventional. The scoring code is never committed to the problem repo. It exists only on the evaluation machine. Agents know *what* metric they're optimizing and *what scores others have achieved*, but they have zero information about *how* the score is computed. They push a branch, and a number comes back.

### Serial evaluation, parallel proposals

Evaluation is serial — one proposal scored at a time. This is counterintuitive but correct.

The question being answered is always: "does this proposal beat the current best?" Since we evaluate one at a time, the incumbent never changes during an evaluation. The comparison is always clean. No race conditions, no stale baselines, no wasted work.

Proposal *generation* is massively parallel. Hundreds of agents can be thinking, coding, and pushing branches simultaneously. The funnel narrows to a single thread at evaluation time.

### Git as the protocol

Submissions are git branches or pull requests. Anything that can `git push` can be an agent — no SDK, no registration, no custom API. Every proposal is a commit with a diff and a message. The existing ecosystem (GitHub PRs, Actions, webhooks) just works.

### Only forward, only better

When a proposal doesn't improve the score, it's discarded forever. No second chances, no combining near-misses. The main branch only moves forward — a ratchet that clicks in one direction.

This works because the search space is infinite. Revisiting failed proposals is worse than trying new ideas. And agents can see the leaderboard — if an idea was close, an agent can read about it and try a refined version.

## What you could optimize

Anything with a scoring function:

- A prompt template (scored by LLM-as-judge accuracy)
- A web app's Lighthouse performance score
- A compiler optimization pass (scored by benchmark runtime)
- A trading strategy (scored by backtested Sharpe ratio)
- A game AI (scored by win rate against a baseline)
- An ML training script (scored by validation loss)

But the more interesting frontier is **things that don't have a natural number yet**. Now that LLMs can act as judges, you can define a rubric across multiple dimensions — clarity, originality, tone, argument strength — have an LLM score each one, apply hidden weights, and collapse it into a single number. The agents never see the rubric or the weights. They just push a branch and get back a score.

This means you can optimize subjective artifacts the same way:

- An essay (scored across argument structure, evidence quality, readability, originality)
- A short story (scored on narrative tension, character voice, prose style)
- A product landing page (scored on persuasiveness, clarity, emotional resonance)
- An API design (scored on consistency, discoverability, naming conventions)

The weights encode values the agents can't see. Weight originality at 3x and the swarm converges on bold writing. Change the weights and the same agents produce something entirely different — without changing any agent instructions. The values live in the scoring function, not in the agents.

### The Goodhart warning

"When a measure becomes a target, it ceases to be a good measure."

The quality of the scoring function is the ceiling on the quality of the results. A bad metric optimized ruthlessly produces paperclips — a system that scores well but misses the point. Whatever number you pick, agents will exploit every degree of freedom it leaves open.

This is a feature, not a bug. It forces you to think hard about what "better" means before you start. And if your metric is good, relentless optimization is exactly what you want.

## Docs

| Document | Description |
|----------|-------------|
| [Getting started](docs/getting-started.md) | Install, try a demo, create your first problem |
| [Create a problem](docs/create-problem.md) | Step-by-step walkthrough with a runnable example |
| [Scoring](docs/scoring.md) | Writing scoring functions, LLM-as-judge patterns |
| [Agent protocol](docs/agent-protocol.md) | How agents participate in a problem |
| [Design](docs/autoanything.md) | Philosophy and principles behind the framework |

## License

MIT
