Metadata-Version: 2.4
Name: workhorse-agent
Version: 0.1.0
Summary: Fail-soft runner for YAML-defined agent workflows — drives the Claude CLI through a workflow graph unattended for days.
Project-URL: Homepage, https://github.com/GabrielCpp/vigilant-octo
Project-URL: Repository, https://github.com/GabrielCpp/vigilant-octo
Project-URL: Issues, https://github.com/GabrielCpp/vigilant-octo/issues
Author: Gabriel Côté
License-Expression: MIT
License-File: LICENSE
Keywords: agent,automation,claude,llm,orchestration,workflow
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Build Tools
Classifier: Topic :: Utilities
Requires-Python: >=3.12
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Description-Content-Type: text/markdown

# local-worker

A Dockerized agent controller that runs YAML-defined workflows using the Claude CLI. Each workflow is a graph of `agent`, `script`, and `branch` nodes. The controller walks the graph, renders Jinja2 prompts, invokes Claude or shell scripts, extracts JSON outputs, and writes run artifacts.

## Intent

The local-worker exists to run long, multi-step agent workflows **unattended** —
the design target is a single run that survives for a week without a human
babysitting it. That goal drives the two defining properties of this tool:

- **Resilience is the default, not a mode.** A single flaky node (an empty
  Claude response, a rate limit, a spending cap, an unparseable output) must
  never crash the whole run. The runner retries transient failures, reframes the
  prompt, and finally defaults a node's outputs so the graph advances to its
  `next` rather than aborting. See [docs/GUARDRAILS.md](docs/GUARDRAILS.md) for the full
  recovery ladder and its tuning knobs.
- **Reproducibility and isolation.** The agent works against its own clones
  inside the container (never a host working tree), all state lives in persistent
  named volumes, and every step is recorded as a run artifact. A run can be
  resumed from its checkpoint after a crash or reboot.

It is repository-agnostic: the same image runs any workflow against any repo a
workflow's `setup.sh` chooses to clone.

## Prerequisites

- Docker Desktop (or Docker Engine + Compose plugin)
- A logged-in Claude **subscription** on the host (`~/.claude/.credentials.json`
  present — i.e. you have run `claude` and authenticated). This is the default
  auth path and matches what your interactive Claude CLI uses.

No Python, `uv`, or Claude CLI installation is required on the host — everything runs inside the container.

## Authentication

By default the worker uses your **Claude subscription**. At startup
`entrypoint.sh` seeds `~/.claude/.credentials.json` from the host (mounted
read-only) into the persistent `claude-state` volume **once**; the CLI then
refreshes/rotates the token in-volume across runs and reboots. A minimal
`~/.claude.json` onboarding stub is written so headless runs don't prompt.

Alternatives:

- **Long-lived OAuth token** — run `claude setup-token` on the host and export
  `CLAUDE_CODE_OAUTH_TOKEN` before `run.sh` (or put it in a `.env` beside
  `compose.yaml`). This skips the credentials-file seed.
- **Bedrock** — uncomment the `CLAUDE_CODE_USE_BEDROCK`/`AWS_PROFILE` env and the
  `~/.aws` mount in `compose.yaml`.

To re-seed credentials after re-authenticating on the host, clear the
`claude-state` volume (`docker volume rm local-worker_claude-state`).

## Quick start

```bash
# From this directory
./run.sh ../workflows/hello-world
```

`run.sh` resolves the workflow path to absolute, validates that `workflow.yaml` exists, and launches the container via `compose.yaml`. Calling it with no arguments prints the available workflows.

## Running any workflow

```bash
./run.sh <path-to-workflow-dir> [docker compose flags]

# Examples
./run.sh ../workflows/story-coder
./run.sh ../workflows/refactor
./run.sh ../workflows/delphi-ci

# Force a full image rebuild
./run.sh ../workflows/hello-world --build

# Workflows installed into a target repo by install.py
./run.sh /path/to/repo/.agents/workflows/story-coder
```

The workflow directory must contain a `workflow.yaml` file. Any `prompts/` and `scripts/` subdirectories are mounted alongside it and are accessible from within the container.

## Environment variables

| Variable | Default | Description |
|---|---|---|
| `WORKFLOW_DIR` | _(required, set by `run.sh`)_ | Absolute path to the workflow directory |
| `CLAUDE_CODE_OAUTH_TOKEN` | _(unset)_ | Optional long-lived OAuth token (`claude setup-token`); skips the credentials-file seed |
| `AGENT_RUNS_DIR` | `/runs` | Where to write run artifacts (set to the persistent `runs` volume by `compose.yaml`) |
| `AGENT_CLI` | `claude` | Which agent CLI drives the run: `claude`, `codex`, or `copilot`. Overridden by `--cli`. See [Choosing the agent CLI backend](#choosing-the-agent-cli-backend) |
| `AGENT_MODEL` | _(unset)_ | Overrides every node's model for the run (a node's own `model:` still wins). Interpreted by the active backend |
| `CODEX_PROFILE` | _(unset)_ | Run-level default codex config profile (e.g. `openrouter`, `local`). A node that names its own profile wins. Codex only |
| `AWS_PROFILE` | `default` | AWS profile — only when using the Bedrock alternative |

## Choosing the agent CLI backend

The controller drives one agent CLI per run, behind a backend facade
(`workhorse/runner/backends.py`). Selection is **per-run**, not per-node:

```bash
./run.sh ../workflows/story-coder            # claude (default)
AGENT_CLI=codex   ./run.sh ../workflows/story-coder
AGENT_CLI=copilot ./run.sh ../workflows/story-coder
# Direct controller invocation also accepts --cli {claude,codex,copilot}
```

| Backend | CLI | Default model | In-place compaction |
|---|---|---|---|
| `claude` | `claude -p` (stream-json) | `sonnet` | yes (`/compact`) |
| `codex` | `codex exec --json` | CLI/profile default | no — ladder reframes on overflow |
| `copilot` | `copilot -p --output-format json` | CLI default | no — ladder reframes on overflow |

### Node model selection

A node's optional `model:` field is interpreted by the active backend. When unset,
the backend's own default applies (so workflows need not hard-code a Claude alias):

```yaml
nodes:
  - id: lead_review
    type: agent
    model: opus           # claude: alias; codex: a config profile (see below)
```

### Codex config profiles (`<profile>@<model-slug>`)

For the `codex` backend, `model:` selects a [codex config profile](https://github.com/openai/codex)
(from `~/.codex/config.toml`) — which bundles provider, auth and a pinned model —
plus an optional model override, written as `<profile>[@<model-slug>]`. `@` is the
delimiter because `/` and `:` already appear inside model slugs:

| `model:` value | Resulting codex flags |
|---|---|
| `local` | `--profile local` (the profile pins the model) |
| `openrouter@deepseek/deepseek-chat-v3.1` | `--profile openrouter -m deepseek/deepseek-chat-v3.1` |
| `openrouter@` | `--profile openrouter` |
| `@gpt-5.5` | `-m gpt-5.5` (no profile; falls back to `CODEX_PROFILE`) |
| _(unset)_ | `CODEX_PROFILE` if set, else codex's own default |

`CODEX_PROFILE` is the run-level default; a node's own `<profile>@…` always wins.
This lets one workflow tier per node — e.g. a lead node on
`openrouter@anthropic/claude-sonnet-4.5` and bookkeeping nodes on `local` (a local
Qwen server) — the same way Claude nodes tier across `opus`/`sonnet`/`haiku`.

```yaml
nodes:
  - id: lead_review
    type: agent
    model: openrouter@anthropic/claude-sonnet-4.5
  - id: record
    type: agent
    model: local          # the local profile's pinned model
```

> Profiles live in `~/.codex/config.toml`. Each names a `model_provider`
> (`base_url` + `env_key`) and a model; codex 0.128+ requires `wire_api = "responses"`.

## Mounts and volumes

| Source | Target | Type | Purpose |
|---|---|---|---|
| `~/.claude/.credentials.json` | `/mnt/claude-credentials.json` | bind, read-only | Subscription auth — seeded into `claude-state` once at startup |
| `~/.claude/settings.json` | `/mnt/claude-settings.json` | bind, read-only | Optional host Claude config (commented out by default) |
| `$WORKFLOW_DIR` | `/workflow` | bind | Workflow definition (yaml, prompts, scripts) |
| `workspace` volume | `/workspace` | named volume | **Agent working tree** — repo clones, branches, and commits; persists across reboots |
| `claude-state` volume | `/claude-state` | named volume | Claude sessions + seeded credentials + onboarding stub; persists across reboots |
| `runs` volume | `/runs` | named volume | Run artifacts; persists across reboots |

### Persistence across reboots

All three named volumes (`workspace`, `claude-state`, `runs`) persist across
container restarts and host reboots, so the agent's work is never lost when the
container stops:

- **`workspace`** holds the cloned repo and the agent's committed branch (e.g.
  `hrnet-research/auto`). Even if a push out of the container fails, committed
  work survives here. (A workflow's `setup.sh` typically `reset --hard`s the base
  branch on re-run, so commit work to a side branch — as the workflows do.)
- **`claude-state`** keeps Claude session history and the refreshed auth token,
  isolated from your host installation. (Note: each node runs with a *clean
  context* — see "Sessions" under Development — so this is not one growing
  cross-node conversation.)
- **`runs`** keeps all run artifacts.

## Resuming and run identity

The controller is **auto-resume-in-place** by default. Each `(workflow, run-id)`
pair maps to one stable run dir (`<workflow>-<run-id>`, run-id defaults to
`default`). On start the controller looks for a checkpoint there:

- **No checkpoint** → start fresh from the `start` node in that dir.
- **Checkpoint present** → resume from the checkpointed node, restoring the saved
  context. A node that finished but didn't advance the cursor (killed in the gap)
  is fast-forwarded past rather than re-run, so side effects like git commits
  aren't duplicated.

This is what lets an unattended run survive a crash or reboot: relaunching the
same workflow continues where it left off. To start over, delete the run dir (or
the `runs` volume). To keep independent runs of the same workflow side by side,
pass distinct run ids.

Controller flags (passed to `workhorse`; `--resume-*` are manual overrides
of the auto behavior above):

| Flag | Purpose |
|---|---|
| `--run-id <id>` | Name the stable run dir (`<workflow>-<id>`); default `default` |
| `--resume-run <path-or-name>` | Resume a specific run dir from its checkpoint |
| `--resume-latest` | Resume the most recent unfinished run under `--runs-dir` |
| `--params '<json>'` / `--params-file <path>` | Override workflow `vars` on a fresh start |

"Survives reboot" therefore covers both the *work products* (commits, sessions,
artifacts) **and** graph position — an interrupted graph auto-resumes mid-run.

## Run artifacts

Each workflow execution writes a timestamped directory:

```
runs/
└── <workflow-name>-<timestamp>-<id>/
    ├── run.json                  # start/end time, terminal state
    ├── context.json              # final context snapshot
    ├── <step-id>/
    │   ├── prompt.md             # rendered Jinja2 prompt sent to Claude
    │   ├── output.json           # extracted JSON outputs
    │   └── context_after.json    # context state after this step
    └── <branch-id>/
        └── branch.json           # { path, value, next }
```

`compose.yaml` sets `AGENT_RUNS_DIR=/runs` so artifacts are written to the
persistent `runs` named volume (they survive reboots and don't pollute the
host working tree). To pull them out, copy from the volume — e.g. from the
assembler repo: `make research-artifacts`.

## Repository isolation

The local-worker is repository-agnostic. **Never add repo-specific bind mounts to `compose.yaml`** — the agent must work against its own checkout of the target repository, not a host working tree.

If a workflow needs to operate on source code (read, edit, build, test), include a `setup.sh` script in the workflow directory. The script runs as the first node and clones the required repositories into the container at a known path (e.g. `/workspace/<repo>`). This ensures:

- The agent always works from a clean, versioned state
- No host working tree is mutated by accident
- The workflow is reproducible on any machine

See `workflows/case-dev/scripts/setup.sh` for an example.

## Resetting state

```bash
# Wipe Claude session history + seeded credentials (re-seed auth on next run)
docker volume rm local-worker_claude-state

# Wipe all run artifacts in the volume
docker volume rm local-worker_runs

# Wipe the agent's working tree (clones/commits) — only if you want a clean clone
docker volume rm local-worker_workspace

# Wipe everything
docker compose down -v
```

## Writing a workflow

A workflow is a directory with this layout:

```
my-workflow/
├── workflow.yaml       # Graph definition
├── prompts/            # Jinja2 .md templates
│   └── step.md
└── scripts/            # Shell or Python scripts (must output JSON to stdout)
    └── check.sh
```

**`workflow.yaml` schema:**

```yaml
name: my-workflow
vars:
  my_var: "default value"   # Initial context variables

start: first_node

nodes:
  - id: first_node
    type: agent              # agent | script | branch | terminal | fail
    prompt: prompts/step.md
    args:
      key: "{{ my_var }}"   # Jinja2 — rendered against context before sending
    outputs:
      - key: result          # Extract this key from the agent's JSON response
        default: {status: ok} # Optional: emitted if the node exhausts all retries
                              # (see "Unattended resilience" below). Unset → null.
    next: check_result

  - id: check_result
    type: branch
    path: result.status      # Dot-path into context
    cases:
      ok: done
      error: done
    default: done

  - id: done
    type: terminal
```

**Branch operators** — in addition to `cases` (equality map), you can use `conditions` for numeric comparisons:

```yaml
  - id: decide
    type: branch
    path: result.count
    conditions:
      - op: ">="
        value: "10"
        next: bulk_path
    default: single_path
```

Supported operators: `==`, `!=`, `<`, `>`, `<=`, `>=`.

**Agent prompts** must output JSON containing the declared output keys:

```markdown
Do the thing.

Output JSON only:

```json
{"result": {"status": "ok", "count": 5}}
```
```

**Scripts** receive Jinja2-rendered args as positional arguments and must print JSON to stdout:

```bash
#!/bin/bash
echo "{\"result\": {\"status\": \"ok\"}}"
```

### Unattended resilience (output `default`)

Because runs are meant to survive a week without supervision, the controller
will, as a last resort, **default an agent node's outputs and advance to `next`**
rather than crash when Claude can't be coaxed into a usable answer (after
transient retries and prompt reframing — see [docs/GUARDRAILS.md](docs/GUARDRAILS.md)).

The runner is generic and doesn't know what your outputs mean, so **you** declare
the safe fallback per output via `default`:

```yaml
    outputs:
      - key: decision
        default: continue          # branch-safe value if this node never answers
      - key: review
        default: {status: auto_approved}
      - key: notes                 # no default → emitted as null
```

Choose defaults that keep the graph moving sensibly (e.g. a branch `path` that
lands on a safe route). An output with no `default` is emitted as `null`. To
disable defaulting entirely and hard-fail instead, set
`AGENT_USE_DEFAULT_OUTPUTS=false`.

## Development

This section is for working on the **controller itself** (the Python that runs
workflows), not on individual workflows.

### Project layout

```
local-worker/
├── workhorse/                 # The workhorse Python package (entrypoint: workhorse:main)
│   ├── main.py                # CLI + the graph walk loop: checkpoint → run node → advance
│   ├── templates.py           # Jinja2 rendering (resilient: missing vars render empty, not raise)
│   ├── artifacts.py           # ArtifactWriter: run dir, checkpoints, per-step artifacts
│   ├── graph/
│   │   ├── nodes.py           # Pydantic node models (AgentNode/ScriptNode/BranchNode/TerminalNode) + Graph
│   │   ├── loader.py          # Parse + validate workflow.yaml into a Graph
│   │   └── context.py         # WorkflowContext: the key→value bag + dot-path lookup for branches
│   └── runner/
│       ├── agent.py           # Invoke Claude CLI; the retry → reframe → default resilience ladder
│       ├── script.py          # Run a ScriptNode, capture JSON stdout
│       └── branch.py          # Evaluate a BranchNode (cases / numeric conditions / default)
├── tests/                     # Standalone test files (see below)
├── compose.yaml               # Service, env, mounts, named volumes
├── Dockerfile                 # Ubuntu + uv + Claude CLI + the controller package
├── entrypoint.sh              # Auth seeding, perms, exec `workhorse`
├── run.sh                     # Host launcher: resolve workflow dir, `docker compose up`
├── pyproject.toml / uv.lock   # Python deps (jinja2, pyyaml, pydantic); managed with uv
├── README.md                  # This file (usage + development)
├── CLAUDE.md                  # Agent entry point; imports README.md + docs/
└── docs/
    └── GUARDRAILS.md          # The resilience/error-recovery design and env-var reference
```

### How the controller works (the loop)

`main.run()` is a single loop over graph nodes. For each node it:

1. **Checkpoints** the current node id + context (`ArtifactWriter.write_checkpoint`) so a crash here is resumable.
2. **Dispatches** by node type to a runner: `runner/agent.py`, `runner/script.py`, or `runner/branch.py`.
3. **Merges** the node's outputs into the `WorkflowContext`.
4. **Writes** a per-step artifact and advances `current_id` to `node.next` (or the branch target).

A `terminal`/`fail` node ends the loop. The resilience for `agent` nodes lives
entirely in `runner/agent.py::run_agent` — see [docs/GUARDRAILS.md](docs/GUARDRAILS.md).

### Sessions (per-node clean context)

**Each node runs as a fresh prompt with a clean Claude context.** The controller
does *not* chain one node's conversation into the next — node N does not inherit
node N‑1's messages. Concretely, `run_agent` drops any persisted `.session_id`
before a node's first attempt, and a reframed attempt also starts fresh.

The persisted session is `--resume`d in exactly one situation: **continuing the
same node that was interrupted.** When the controller resumes from a checkpoint
and re-enters a node that was killed mid-run (not fast-forwarded), it calls
`run_agent(..., resume_session=True)` for that one node so Claude picks up where
it left off; every node the run then advances to starts clean again.

**Context overflow → compact & continue.** If a node exhausts the model's
context window mid-run (the headless CLI returns instead of auto-compacting),
`run_agent` runs `/compact` on that node's session and retries the *same* prompt
on it, preserving the node's progress (bounded by `AGENT_MAX_COMPACT_ATTEMPTS`;
falls back to a fresh-session reframe if `/compact` can't help). Verified against
Claude Code 2.1.x. See the recovery ladder in [docs/GUARDRAILS.md](docs/GUARDRAILS.md).

> Not yet implemented: a configurable *per-node turn limit* (`--max-turns`) that
> proactively compacts before the window is exhausted. Today compaction is
> reactive — triggered when an overflow is detected.

### Running tests

Tests live in `tests/` and are **dependency-free**: each file runs standalone
(`python tests/test_x.py` prints PASS/FAIL and exits non-zero on failure) and is
also pytest-compatible. There is no pytest in the venv by default; run them with
the project's Python:

```bash
# One file
.venv/bin/python tests/test_agent_recovery.py

# All of them
for t in tests/test_*.py; do .venv/bin/python "$t"; done
```

If a `.venv` isn't present, create one with `uv sync` (or `uv run python tests/...`).

**Where to put tests.** Add a `tests/test_<area>.py`, mirroring the existing
style: a `if __name__ == "__main__"` runner that iterates `test_*` functions, and
unit tests that patch the CLI boundary (`_run_claude_cli` / `_invoke_claude`) and
sleeping so nothing hits the network or waits in real time. Group by concern:
`test_agent_cap.py` (cap/transient handling), `test_agent_recovery.py` (reframe →
default ladder), `test_branch_guardrail.py`, `test_resume_auto.py`,
`test_idempotency.py`, `test_templates_resilient.py`.

### Where docs go

- **Tool/usage + development docs** → this `README.md` (root).
- **Design notes** (resilience/error recovery, and any future deep-dives) →
  `docs/`, e.g. `docs/GUARDRAILS.md`. Put new long-form design docs here rather
  than at the root.
- **`CLAUDE.md`** (root) is the agent entry point and stays at the root so Claude
  Code auto-loads it; it `@`-imports `README.md` and `docs/GUARDRAILS.md`.
- **Per-workflow docs** → inside that workflow's own directory (under
  `../workflows/<name>/`), not here. The controller is workflow-agnostic; keep
  workflow-specific knowledge with the workflow.

Keep these docs current when you change behavior — they are the contract for
operators running week-long jobs, and `CLAUDE.md` imports them, so updating them
keeps agent context accurate too.

### Conventions

- **Python 3.12**, `from __future__ import annotations` at the top of each module.
- **Pydantic** models for anything parsed from YAML (see `graph/nodes.py`); add a
  new node type by extending the discriminated `Node` union and handling it in
  `main.run()` plus a `runner/`.
- **Fail soft for unattended runs.** New failure paths in agent handling should
  slot into the existing retry → reframe → default ladder rather than raising, so
  one bad node can't end a week-long run. Reserve hard raises for genuinely
  unrecoverable, deterministic errors.
- **Comments explain *why*.** Match the existing density — the tricky invariants
  (checkpoint/fast-forward idempotency, cap-vs-transient classification) are
  documented inline; keep them that way.

### Editing the container

The image bundles the Claude CLI and the controller package. After changing
`Dockerfile`, `pyproject.toml`, or anything that affects the image, rebuild:

```bash
./run.sh ../workflows/hello-world --build
```

Pure controller `.py` edits are picked up on the next run only after a rebuild
too, since `workhorse/` is `COPY`d into the image (it is not bind-mounted).
