Metadata-Version: 2.4
Name: marcus-mini
Version: 0.1.1.post2
Summary: Board-mediated multi-agent coordination in ~500 lines of Python
Author-email: Larry Gray <lwgray@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/lwgray/marcus-mini
Project-URL: Repository, https://github.com/lwgray/marcus-mini
Project-URL: Issues, https://github.com/lwgray/marcus-mini/issues
Project-URL: Changelog, https://github.com/lwgray/marcus-mini/releases
Keywords: multi-agent,ai,coordination,llm,claude,kanban,automation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.40.0
Requires-Dist: click>=8.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: isort>=5.12; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="assets/logo.png" alt="marcus-mini" width="220">
</p>

<p align="center">
  <strong>Board-mediated multi-agent coordination in ~500 lines of Python.</strong>
</p>

<p align="center">
  <em>Multiple AI agents work in parallel on a shared task board.<br>
  They never talk to each other — they coordinate exclusively through SQLite.</em>
</p>

<p align="center">
  <a href="https://pypi.org/project/marcus-mini/"><img alt="PyPI" src="https://img.shields.io/pypi/v/marcus-mini?color=blue"></a>
  <a href="https://pypi.org/project/marcus-mini/"><img alt="Python" src="https://img.shields.io/pypi/pyversions/marcus-mini"></a>
  <a href="https://github.com/lwgray/marcus-mini/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/lwgray/marcus-mini/actions/workflows/ci.yml/badge.svg"></a>
  <a href="LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-yellow"></a>
</p>

<p align="center">
  <a href="#quickstart">Quickstart</a> ·
  <a href="#why-a-board">Why a board?</a> ·
  <a href="#commands">Commands</a> ·
  <a href="#how-it-works">How it works</a>
</p>

---

## What it looks like

<p align="center">
  <img src="assets/demo.gif" alt="marcus-mini demo" width="700">
</p>

``` bash
$ mini build "a snake game in Python"

  Decomposing goal → 8 tasks
  Spawning 3 agents (DAG recommends 3) in tmux session 'marcus-snake-game-1146'
    Spawned agent-1 (pane 0)
    Spawned agent-2 (pane 1)
    Spawned agent-3 (pane 2)

  ✓ 3 agents running. mini watch to follow along.
```

`mini dag` shows the decomposed work as it's running:

``` ascii
  ╭────────────────────╮
  │ project structure  │
  ╰────────────────────╯
            │
  ┌─────────┴─────────────┐
  ▼                       ▼
╭────────────────────╮  ╭────────────────────╮
│  game state model  │  │   render engine    │
╰────────────────────╯  ╰────────────────────╯
            │                       │
            └────────┬──────────────┘
                     ▼
          ╭────────────────────╮
          │   collision logic  │
          ╰────────────────────╯
                     │
                     ▼
          ╭────────────────────╮
          │     game loop      │
          ╰────────────────────╯
```

`mini bench` measures coordination overhead after the run completes:

``` ascii
Bench — snake-game-1146
─────────────────────────────────────
  Wall time         8m 34s
  Agent work        18m 12s
  Utilization       70.8%
  Coordination tax  29.2%
  Tasks             8 (8 done)
```

---

## The idea

Most multi-agent frameworks let agents talk to each other directly. `marcus-mini`
does the opposite: agents are **blind to each other** and coordinate only through
a shared board.

Three invariants hold in every run:

1. **Agents self-select work.** The board assigns nothing — agents pull the next
   available task whose dependencies are all complete.
2. **Agents make all implementation decisions.** The board says *what* to build,
   never *how*.
3. **Agents communicate only through the board.** Artifacts and decisions logged
   to a task become context for downstream agents. No direct messages, no shared
   memory outside the board.

This mirrors how distributed teams actually work: a shared backlog, async
handoffs, no mandatory standups.

---

## Why a board?

Board mediation solves a specific coordination problem: how do you get multiple
agents to work in parallel without them stepping on each other or needing to
constantly sync?

The alternatives all have problems. If agents talk to each other directly, you
get an N² communication explosion — every agent has to know about every other
agent, and conversations multiply as you scale. If they share memory, you get
race conditions and need locks. If a central orchestrator hands out work, that
orchestrator becomes the bottleneck and single point of failure, plus it has to
be smart enough to know what each agent is capable of.

```text
Direct agent-to-agent              Board-mediated
(N² edges)                         (N edges)

   A1 ─────── A2                       A1
   │ ╲       ╱ │                       │
   │   ╲   ╱   │                       ▼
   │     ╳     │                  ┌─────────┐
   │   ╱   ╲   │            A2 ──▶│  board  │◀── A4
   │ ╱       ╲ │                  └─────────┘
   A4 ─────── A3                       ▲
                                       │
                                       A3

  N=4  →   6 edges                N=4  →  4 edges
  N=10 →  45 edges                N=10 → 10 edges
  N=30 → 435 edges                N=30 → 30 edges
```

A board flips the model: the environment holds the state, and agents are
stateless workers that pull from it. This gives you a few things for free:

- **Atomic work claiming.** A SQL transaction (`BEGIN EXCLUSIVE` + dependency
  check) guarantees two agents can't grab the same task. No lock manager, no
  consensus protocol.
- **Async handoffs.** Agent A finishes a task and logs an artifact. Agent B,
  hours later, picks up a dependent task and reads that artifact as context.
  Neither needs to be online at the same time.
- **Agent blindness as a feature.** Because agents don't know about each other,
  you can add, remove, or crash them freely. The board doesn't care. This is
  why mini can spawn 3 or 30 agents with the same code.
- **Auditability.** Every decision and artifact is persisted. You can replay
  what happened, measure coordination tax, debug why an agent went off the
  rails.

It's the same pattern distributed teams already use — a shared backlog instead
of everyone DM'ing each other. The constraints are similar: unreliable
participants, variable work duration, no guarantee anyone's online right now.

### Real-world performance

We benchmarked `marcus-mini` head-to-head against AutoGen and LangGraph using
each framework's native coordination pattern (AutoGen `SelectorGroupChat` for
agent-to-agent chat, LangGraph supervisor-worker for central LLM orchestration).
Same task DAG, same model (`claude-sonnet-4-6`), same agent count.

<p align="center">
  <img src="experiments/topologies/results/summary_crewai_langgraph.png" alt="Topology benchmark" width="700">
</p>

| Size | Tasks | mini | AutoGen GroupChat | LangGraph supervisor |
|------|-------|------|-------------------|----------------------|
| 1 | 9 | **4.4 min** | 12.5 min (4/9 done) | 16.9 min |
| 2 | 17 | **9.3 min** | 53.9 min | 37.9 min |
| 3 | 27 | **14.7 min** | 105.6 min | 98.9 min |

The gap grows with project size: ~3× at 9 tasks, ~7× at 27 tasks. AutoGen at
size 1 only completed 4 of 9 tasks before its agents tangled in their own
conversation. CrewAI is pending (we hit a validation error on multi-level
dependency graphs and are seeking the idiomatic pattern from their team).

Full methodology, runners, and reproduction instructions:
[`experiments/topologies/`](experiments/topologies/).

---

## Quickstart

**Requirements:** Python 3.11+, [Claude Code CLI](https://claude.ai/code), tmux

```bash
pip install marcus-mini
export MINI_API_KEY=sk-ant-...   # your Anthropic key — for decomposition only
mini build "a snake game in Python"
```

> **Why `MINI_API_KEY` and not `ANTHROPIC_API_KEY`?**
> Claude Code agents inherit your shell environment. If `ANTHROPIC_API_KEY` is set,
> agents use it for every API call — even if you have a Claude subscription — and
> you get charged. `MINI_API_KEY` is only read by the decomposer (one call per
> `mini build`). Agents run under your subscription key, not this one.

`mini build` will:

1. Call Claude to decompose the goal into a parallel task DAG
2. Persist all tasks to a local SQLite board
3. Spawn N agents in a tmux session (N = DAG width)
4. Each agent loops: claim task → implement → log artifacts → mark done

Watch the board live:

```bash
mini watch          # live-refreshing kanban
mini status         # per-agent activity
mini bench          # coordination metrics after completion
```

---

## Install

```bash
pip install marcus-mini
```

Or from source:

```bash
git clone https://github.com/lwgray/marcus-mini
cd marcus-mini
pip install -e .
```

---

## Commands

### Build & monitor

| Command | Description |
| --- | --- |
| `mini build "goal"` | Decompose goal, spawn agents |
| `mini watch` | Live kanban board (auto-exits on completion) |
| `mini board` | Static kanban snapshot |
| `mini status` | Per-agent activity with staleness warnings |
| `mini dag` | ASCII dependency graph |
| `mini tasks` | Task list with dependency info |
| `mini progress` | Completion percentage |
| `mini time` | Project elapsed time |
| `mini logs` | Tail agent log files |
| `mini wait` | Block until all tasks reach DONE/FAILED (scripting-friendly) |

### Measure

| Command | Description |
| --- | --- |
| `mini bench` | Wall time, utilization, coordination tax |

### Manage projects

| Command | Description |
| --- | --- |
| `mini projects` | Running projects (`--all` to see completed too) |
| `mini load` | Total live agents across all running projects |
| `mini open` | Print output directory (`cd $(mini open)`) |
| `mini stop` | Kill the current project's agents (DB + tmux) |
| `mini stop --all` | Stop every running project (type STOP to confirm) |
| `mini purge` | Wipe every project from the DB (type DELETE to confirm) |
| `mini config` | View/set API key env var |

### Common flags

```bash
mini build "goal" --agents 4          # override agent count
mini build "goal" --output-dir ~/myproject
mini watch --interval 5               # refresh every 5s
mini bench --project my-project-1200
mini wait --interval 30 --timeout 5400  # 90-min ceiling
```

---

## How it works

``` ascii
mini build "snake game"
        │
        ▼
  ┌─────────────┐
  │  decomposer │  Claude call → flat task list + dependency DAG
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │    board    │  SQLite (WAL mode) — the shared environment
  └──────┬──────┘
         │  MCP server exposes 5 tools to each agent
         │
    ┌────┴────┐
    │         │
  agent-1   agent-2   ...   agent-N     (tmux panes)
    │         │
    └─────────┘
    read/write the same board, never each other
```

### Board MCP tools (what agents see)

| Tool | Purpose |
| --- | --- |
| `request_next_task` | Claim next task whose deps are all DONE |
| `log_artifact` | Store output (API spec, schema, file path) |
| `log_decision` | Record an architectural choice |
| `get_task_context` | Read artifacts from dependency tasks |
| `report_done` | Mark task complete, unblock dependents |

### Task assignment (atomic SQL)

A task is claimable when `status = 'TODO'` and every dependency has
`status = 'DONE'`. The claim runs inside `BEGIN EXCLUSIVE` with a
`json_each()` dependency check — no race conditions, no external lock manager.

### Coordination tax

`mini bench` measures the gap between theoretical and actual parallelism:

``` ascii
agent utilization  =  total task work  /  (n_agents × wall_time)
coordination tax   =  1 − utilization
```

A tax of 0 % means every agent was always working. Typical software projects
land at 30–60 % because of the critical path.

---

## Project structure

``` ascii
marcus-mini/
├── marcus_mini/
│   ├── board.py          # SQLite board + async API
│   ├── board_server.py   # MCP server (5 tools)
│   ├── cli.py            # mini CLI (Click)
│   ├── decomposer.py     # Claude → task DAG
│   ├── models.py         # Task dataclass
│   ├── monitor.py        # tmux monitor pane
│   └── spawn.py          # tmux agent spawner
├── prompts/
│   └── agent_prompt.md   # agent loop instructions
├── tests/
└── pyproject.toml
```

---

## Design discipline: the Mini Red Line

`marcus-mini` is intentionally small. Every proposed feature has to pass one test:

> **"Does coordination break without this?"**
> Yes → allowed. No → it doesn't belong in mini.

What's ruled out, even if it would be useful:

- Observability beyond `mini status` (no dashboards, metrics pipelines)
- Resilience infrastructure (no retry, circuit breakers, fallbacks)
- Rich configuration (one flat JSON file, period)
- External integrations (no Slack, GitHub, webhooks)
- Agent capability management (no skills, tools, specializations)
- Scheduling (no cron, recurring tasks)

What stays, even if it adds lines of code:

- Correctness fixes (stall detection, accurate liveness checks)
- Coordination primitives (task claiming, dependency resolution, spawning)
- Measurement (`mini bench`, timing, coordination tax)

If a feature would be at home in [Marcus](https://github.com/lwgray/marcus),
it doesn't belong in mini.

---

## The evolution: Marcus

`marcus-mini` proves the concept on one stack: SQLite, Claude, tmux. Three
moving parts, ~500 lines of coordination, intentionally bounded so the
primitives stay readable.

[**Marcus**](https://github.com/lwgray/marcus) takes the same board-mediated
primitives and scales them into a production platform: contract-first
decomposition (agents agree on APIs/schemas before any code is written),
a full observability pipeline, provider abstraction across LLMs and agent
runtimes, resilience infrastructure with lease-based recovery, and multi-team
isolation. If you find yourself fighting mini's intentional limits, that's
the signal to graduate.

<details>
<summary>Full capability comparison</summary>

| Capability | mini | Marcus |
| --- | --- | --- |
| **Contract-first decomposition** — agents agree on APIs/schemas before any code is written, so coordination works in domains without code (legal, scientific, design) | flat task DAG only | full contract synthesis pre-fork |
| **Observability** — structured audit logs, event pub/sub, per-tool duration tracking, agent lifecycle events | `mini status` + `mini bench` | full telemetry pipeline |
| **Resilience** — agent retry, circuit breakers, error taxonomy, automatic stall recovery, lease-based work claiming | none (fails loud) | error framework + lease/recovery layer |
| **Provider abstraction** — board protocol independent of which LLM or which agent runtime | Claude + Claude Code only | multi-provider board protocol |
| **Domain extensibility** — coordinate non-software work via contract templates | software builds only | research, ops, content, analysis |
| **Multi-team / multi-project** — concurrent boards, RBAC, shared artifact registry | single user, single board | team and tenant isolation |
| **Configuration** — environments, profiles, per-agent tuning | one flat JSON | hierarchical config + per-environment overrides |
| **Observability dashboard** — real-time network graph, swim lanes, conversation logs, timeline playback, artifact preview ([Cato](https://github.com/lwgray/cato)) | none | live + historical run analysis |
| **Experiment runner** — automated multi-agent scaling pipelines, MLflow metric tracking, live WebSocket terminals ([Posidonius](https://github.com/lwgray/posidonius)) | none | controlled scaling experiments |
| **Project grading** — nine-dimension quality audit, runtime smoke tests, agent authorship cohesion, persisted scores (Epictetus Claude skill) | none | structured quality reports |
| **Scheduling** — recurring tasks, cron, triggered runs | one build at a time | full scheduler |

</details>

---

## Known limitations

- **Claude-only.** The decomposer uses the Anthropic SDK directly, and agents
  are Claude Code processes. There is no provider abstraction. This is a
  deliberate scope choice for a research instrument.
- **Single user, single key.** No multi-tenancy, no RBAC, no team workflows.
- **Local only.** SQLite + tmux on one machine. No remote agents, no cluster.
- **Best effort failure handling.** If an agent dies or the API quota runs out,
  mini stops loud — there is no auto-recovery. By design.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Bug reports and PRs welcome — but new
features must pass the Red Line test above.

---

## License

[MIT](LICENSE)
