Metadata-Version: 2.4
Name: capus
Version: 2.1.0
Summary: Capus: persona-driven LLM agent testing for macOS and web apps, served as a local MCP daemon
Project-URL: Homepage, https://github.com/DanielBirk04/capus
Author: Daniel Birk
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: llm-agents,macos,mcp,personas,playwright,testing,ui-testing
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: MacOS X
Classifier: Intended Audience :: Developers
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: jinja2>=3.1
Requires-Dist: mcp>=1.10
Requires-Dist: pillow>=10.0
Requires-Dist: pyobjc-framework-applicationservices>=10.3
Requires-Dist: pyobjc-framework-cocoa>=10.3
Requires-Dist: pyobjc-framework-quartz>=10.3
Requires-Dist: pyobjc-framework-vision>=10.3
Requires-Dist: pyyaml>=6.0
Provides-Extra: browser
Requires-Dist: playwright>=1.45; extra == 'browser'
Provides-Extra: load
Requires-Dist: httpx>=0.27; extra == 'load'
Requires-Dist: psutil>=5.9; extra == 'load'
Provides-Extra: runner
Requires-Dist: anthropic>=0.40; extra == 'runner'
Requires-Dist: openai>=1.109; extra == 'runner'
Provides-Extra: vision
Requires-Dist: einops; extra == 'vision'
Requires-Dist: huggingface-hub>=0.23; extra == 'vision'
Requires-Dist: timm; extra == 'vision'
Requires-Dist: torch>=2.3; extra == 'vision'
Requires-Dist: transformers<4.50,>=4.45; extra == 'vision'
Requires-Dist: ultralytics>=8.2; extra == 'vision'
Provides-Extra: whole-system
Requires-Dist: httpx>=0.27; extra == 'whole-system'
Requires-Dist: psycopg[binary]>=3.1; extra == 'whole-system'
Requires-Dist: sqlalchemy>=2.0; extra == 'whole-system'
Description-Content-Type: text/markdown

# Capus

Persona-driven LLM agent testing for macOS apps **and web apps**, served as
a **local MCP daemon**. LLM agents role-play sampled human personas
("silicon sampling") and exercise your app through the GUI — screenshots in,
real synthesized input out. The daemon is deliberately dumb (no LLM calls,
ever); the intelligence is an MCP client: Claude Code/Codex driving it
interactively, **or the built-in runner** started from the dashboard or
`capus agents` (Anthropic API key or a Claude subscription via the
`claude` CLI).

Two target drivers behind one identical tool surface:

- **macOS** (`app_path` = .app/executable/.py): window screenshots parsed by
  **OmniParser v2**, CGEvent mouse/keyboard. Needs the machine free during
  runs (one shared screen + pointer).
- **Browser** (`app_path` = http(s):// or file:// URL): one isolated headless
  Chromium context per persona session — truly parallel, zero desktop
  contention, no macOS permissions. The persona still *sees* a screenshot
  (numbered Set-of-Mark boxes), but element geometry comes from the DOM
  (visible, in-viewport, interactive elements with accessible names; shadow
  DOM and same-origin iframes traversed), input goes through CDP at real
  coordinates (move, then click — the same trusted path as a human mouse),
  and oracles get browser-grade signals: uncaught JS exceptions, console
  errors, failed/4xx+ requests, native dialogs, renderer crashes. OmniParser
  remains the automatic fallback for canvas-rendered UIs.

Output is two deliverables from one findings database:

- **`report.html` / `report.md`** — for the developer: findings with
  screenshots and repro steps, business-rule coverage matrix (personas ×
  rules), persona journey filmstrips.
- **`feedback.json`** — for coding agents: stable finding IDs, expected vs
  observed, deterministic repro traces, a status queue (`open`/`fixed`/
  `wontfix`) plus MCP tools (`findings_query`, `trace_replay`,
  `finding_update`) to fix and verify autonomously.

## Architecture

```
the MIND (an MCP client — choose one per run)
  Claude Code / Codex            capus runner (capusd/runner.py)
  orchestrator + tester          spawned by the dashboard ▶ button or
  subagents (parallel)           `capus agents`; Anthropic API or claude CLI
        │ MCP over streamable HTTP (127.0.0.1:7777/mcp)
        ▼
capusd  (the BODY: dumb on purpose — zero LLM calls)
  session manager + work queue │ persona sampler (correlated human sampling)
  humanize (motor simulation: typing cadence+typos, Fitts-law mouse paths,
  click scatter, think pauses — seeded per session) │ Driver protocol
    ├─ macOS driver (Quartz windows, screenshots, CGEvent input)
    └─ browser driver (Playwright Chromium, DOM perception, CDP input)
  vision (OmniParser v2 + Apple-Vision OCR) │ oracles (crash/hang/log/no-op
  + JS/console/network/dialog for browser) │ SQLite store + artifacts │
  report generator │ dashboard (live streaming + run/persona control)
```

One daemon serves many parallel clients. The split is deliberate: the
client decides WHAT a persona does (cognition), the daemon decides HOW it
physically unfolds (motor execution) — so even a fast model produces
sessions that look and pace like a human when watched live.

## Human realism

Personas are sampled, not invented: demographics from an overridable
population spec, behavioral traits drawn from a correlated-but-noisy latent
model calibrated against real data — OECD PIAAC tech-skill bands (~60% of
adults are level 1 or below: the "average user" fails multi-step tasks),
CDC disability prevalence, CHI'18 typing statistics (52±25 WPM, ~6% of
keystrokes are corrections). Counter-stereotypical draws are guaranteed by
design (the 79-year-old engineer happens; so does the low-tech 19-year-old).

Each persona compiles into:

- a **behavior contract** (`task_claim` returns it) — first-person system
  prompt: identity, reading style, patience, blame attribution, quirks, and
  hard anti-"superuser" rules (satisfice, knowledge limits, giving up is
  valid data) — plus a 3-line `persona_reminder` to re-inject every few
  turns against persona drift;
- a **motor profile** the daemon enforces mechanically: typing WPM with
  corrected typo bursts, Fitts's-law mouse movement (curved Bezier paths,
  bell velocity, terminal overshoot), click scatter inside targets,
  hesitation pauses. Seeded per session — reproducible. `pacing: "fast"`
  switches it all off for CI throughput.

**macOS targets**: perception AND targeting are pure screenshot vision
(OmniParser); actuation is real synthesized mouse/keyboard events
(move-then-click, like a human) at the vision-derived coordinates —
serialized through a global input mutex, so on one attended machine sessions
interleave. Emergency aborts: touch /tmp/capus.stop, hold cmd+option+ctrl+
escape, or slam the pointer into a screen corner. For unattended, truly
parallel runs, a VM/container isolation tier is on the roadmap.

**Browser targets**: each session gets its own incognito Chromium context
(own storage, own pointer) — no input mutex, no contention with you or with
other sessions; the isolation problem the macOS tier needs VMs for simply
doesn't exist. Headless by default; pass `headless: false` to run_create to
watch.

## Install

One line:

```bash
curl -fsSL https://raw.githubusercontent.com/DanielBirk04/capus/main/scripts/install.sh | sh
```

That installs [uv](https://docs.astral.sh/uv/) if needed, installs capus as
an isolated tool (with web-target support out of the box), and runs the
guided **`capus setup`** wizard: environment checks, headless Chromium,
optional native-app permissions/models, one-click Claude Code wiring (MCP +
plugin), then starts the daemon and opens the dashboard. Already have uv?

```bash
uv tool install 'capus[browser]' && capus setup
```

The dashboard's **Setup page** (`http://127.0.0.1:7777/#/setup`) mirrors the
same checklist with live re-checks and action buttons — first run lands
there automatically.

Web (URL) targets need **no macOS permissions at all**. Native macOS apps
are the advanced path: Screen Recording + Accessibility for the app hosting
the daemon (your terminal — restart it after granting) and the OmniParser
vision extra (`pip install 'capus[vision]'` + `capus models download`,
~1.5 GB). Without it the daemon still works for browser targets (DOM
perception) and in **OCR-only degraded mode** for macOS targets.

### Development install

```bash
uv venv --python 3.12 .venv
uv pip install --python .venv/bin/python -e '.[browser]'  # web targets
uv pip install --python .venv/bin/python -e '.[vision]'   # + OmniParser deps (heavy)
.venv/bin/playwright install chromium                      # browser binary
.venv/bin/capus models download                            # OmniParser v2 weights (~1.5 GB)
.venv/bin/capus doctor                                      # permissions check
```

## Run

```bash
capus serve --open        # daemon + dashboard (--open pops the browser)
# register in Claude Code (capus setup / the Setup page do this for you):
claude mcp add --transport http capus http://127.0.0.1:7777/mcp
```

**Control dashboard** — open `http://127.0.0.1:7777/` in a browser while the
daemon runs. It streams the live screen each agent is driving, lists every
run with its findings, and lets you drill into a run (persona cards,
findings with screenshots and the exact pages where each problem appears) or
into a session's full reasoning trace — every step's screenshot, the action,
and the agent's intent ("why, in persona voice"), plus the persona's exit
verdict and per-session model/cost. Every run and session view has
**Copy Markdown** / **Download .md** so the whole thing is documented and
portable.

The dashboard is also the **control room**: **⚡ Autorun** is the one-click
autopilot — scope the job in a short form (thoroughness, lens, author-a-spec vs
existing pack, focus, and whether to fix afterward) and a headless coding agent
runs the whole `/capus:auto` lifecycle in your repo (author a spec, run the
chosen test modes via the built-in runner, judge, and — if you let it — fix and
re-test in a loop until no new meaningful findings remain), streaming
every run and finding into the dashboard live. **▶ New run** configures and
launches a single run (target, goals, persona count/seed or hand-picked library
personas, pacing, model, parallel workers) and can start the built-in agent
runner with one click; **Personas** manages the persona library — sample
new ones, edit names/backstories (first-person interview style conditions
best), and preview each persona's compiled behavior contract. Runs created
from chat (Claude Code) appear the same way and can be started/stopped from
either side. POST endpoints honor `CAPUS_DASHBOARD_TOKEN` (Bearer) when set.

**Credentials** manages a local vault of accounts the personas may need
(staging logins etc.): attach credential sets to a run in the New-Run form
and each persona receives them with its assignment and signs in naturally.
Secret values (keys containing password/secret/token/pin/key/otp) are
masked out of every recorded trace, repro step and report — agents type
them, records show `{{secret:…}}` placeholders, and `trace_replay` resolves
them back at replay time. Use test accounts, not production ones.

Sessions have **no fixed step limit** — a persona keeps going as long as it
makes progress (the daemon stops a session only after 25 consecutive
zero-change actions, or at the 10000-step runaway cap). Patience is still
personal: low-patience personas give up early because they *choose* to.
Parallel workers are automatically capped at the number of queued sessions.

**⇪ Share with coding agent** (run view) hands findings to the agent that
will fix them: pick the findings, point at your project folder, optionally
add instructions, then choose where it lands —
**VS Code** (recommended: the agent spawns in the project folder and starts
working immediately, connected to the capus MCP so it can `trace_replay`
fixes and mark findings `fixed`; VS Code opens there so you can watch and
resume the chat from the Claude Code panel — status also streams into the
run page's Handoffs panel), **Claude desktop app** (a new chat via
`claude://` deep link, pre-filled with the brief; plain chat, not
folder-bound), or **⧉ Copy brief** (paste into any coding agent).

Model/effort for agents and handoffs work exactly like a normal chat:
the default inherits your own Claude settings (model AND effort), or pick
any current model (Fable 5, Opus, Sonnet, Haiku, 1M-context variants) and
an effort level (low → max) per run/handoff.

A note on cost: the runner's `auto` backend prefers the **claude CLI, which
runs on your Claude subscription — no API key is billed**. The `$` figures
shown for such sessions (marked `sub`/`≈`) are the nominal API-equivalent
the CLI reports, not a charge. The pay-per-token Anthropic API backend is
strictly opt-in.

While watching a live session you can **moderate it like a usability test**:
the 🎙 Steer box delivers your instruction with the agent's next look at the
screen ("now try to export a PDF", "you may give up now") — the persona
acknowledges it in voice and follows it, and the note is recorded in the
trace (🎙 operator). **■ Stop** ends one session (overview cards and the
live view) without touching the rest of the run; the run-level Stop halts
the runner and all its sessions.

**Headless agents without Claude Code** — `capus agents --run-id <id>
[--model haiku|sonnet|opus] [--workers N] [--backend auto|api|openai-api|claude-cli|codex-cli]`
plays all queued sessions of a run. The runner is an ordinary MCP client:
the daemon stays dumb even when the dashboard's Start button spawns it.

Install the skills/agents as a plugin — `capus setup` (or the dashboard's
Setup page) does this with one click; manually from a checkout:

```bash
claude plugin marketplace add ./client/claude
claude plugin install capus@capus-marketplace
```

Claude Code workflow (skills in `client/claude/capus`) — Capus is fully
drivable from chat; **`/capus:help` is the front desk** (the command map +
troubleshooting, and it routes you to the right command). For a hands-off run,
**`/capus:auto` is the autopilot** — it asks a few multiple-choice scoping
questions, then autonomously runs the whole loop below (spec → recon → fixtures
→ diverse personas → both test modes, looping back for deeper passes), shows you
the findings, asks what to fix, then loops test → fix → retest until no new
meaningful findings remain. To drive it step by step instead, the core
loop is:

1. `/capus:setup` — extracts business rules from your PRD into
   `capus/rules.yaml`, samples personas, writes their narrative cards.
   (Prefer `/capus:spec` for realistic-workflow testing; `/capus:recon` grounds
   a spec against the live app; `/capus:personas` authors a custom panel.)
2. `/capus:run` — spawns parallel tester subagents; each claims a
   persona-session and plays it against the app.
3. `/capus:report` — judge pass + generates `report.html`, `report.md`,
   `feedback.json`.
4. `/capus:fix` — run inside the app's repo: works open findings, verifies
   fixes with `trace_replay`, marks them fixed.

The control-room commands cover everything else the dashboard does, from chat:
`/capus:status` (list/inspect runs, sessions and findings; start/stop the
built-in runner; steer or stop a live session), `/capus:doctor` (permissions,
vision models, browser, daemon health), and `/capus:credentials` (the
login/secret vault). **Chat and the dashboard are two windows onto one daemon
store** — anything you do from chat is persisted immediately and visible in the
dashboard (runs, traces, screenshots, verdicts, findings) days or weeks later;
nothing is chat-only.

Codex: see `client/codex/AGENTS.md`.

## Try it on the demo apps

`examples/invoice_mini/app.py` is a tiny Cocoa app with planted bugs (a dead
Export button, a missing volume discount, a wrong confirmation message,
silent input validation). Its PRD is `examples/invoice_mini/PRD.md`, expected
extracted rules in `examples/rules.example.yaml`. A full verification pass:
setup → run (3 personas) → report should find at least the dead control and
the discount rule violation.

`examples/invoice_web/index.html` is the same app (and the same 4 planted
bugs) as a self-contained web page — point run_create at
`file:///…/examples/invoice_web/index.html` to exercise the browser driver
end to end, headless, with no permissions.

## Security

Capus runs entirely on your machine. The daemon and dashboard bind to
`127.0.0.1` and assume a **localhost trust boundary**: the read-only
dashboard API (runs, sessions, live screenshots) is open to local processes.
Mutating routes (start/stop runs, edit personas/credentials) **and the
credentials vault** require `CAPUS_DASHBOARD_TOKEN` (a `Bearer` token) when
it is set — set it before exposing the dashboard beyond localhost (e.g.
behind an authenticating reverse proxy or tunnel).

Test credentials live in a local SQLite vault. Secret-ish field values (keys
containing `password/secret/token/pin/key/otp`) are masked out of every
recorded trace, repro step and report — agents type them, records only show
`{{secret:…}}` placeholders. Use dedicated test accounts, never production
credentials.

## Notes

- macOS only (Quartz, Apple Vision OCR, Screen Capture). Apple Silicon
  recommended for OmniParser on MPS.
- OmniParser v2 icon-detector weights are AGPL-3.0 (caption model MIT) —
  fine locally; re-check before commercial distribution.
- Data dir: `~/.capus` (override with `CAPUS_DATA_DIR` or `--data-dir`).
