Metadata-Version: 2.4
Name: brooder
Version: 0.1.0
Summary: Snapshot testing for AI agents — catch behavior regressions before they ship.
Project-URL: Homepage, https://brooder.dev
Project-URL: Repository, https://github.com/agentbrooder/brooder
Project-URL: Issues, https://github.com/agentbrooder/brooder/issues
Author: Brooder
License: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: agents,ai,ci,evals,llm,regression,snapshot,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.5
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: claude-agent
Requires-Dist: claude-agent-sdk>=0.1; extra == 'claude-agent'
Provides-Extra: dev
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3; extra == 'langchain'
Provides-Extra: openai-agents
Requires-Dist: openai-agents>=0.1; extra == 'openai-agents'
Provides-Extra: otel
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.20; extra == 'otel'
Description-Content-Type: text/markdown

<p align="center">
  <img src="assets/banner.svg" alt="Brooder — snapshot testing for AI agents" width="760">
</p>

<p align="center">
  <a href="https://github.com/agentbrooder/brooder/actions/workflows/ci.yml"><img src="https://github.com/agentbrooder/brooder/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://pypi.org/project/brooder/"><img src="https://img.shields.io/pypi/v/brooder?color=3b82f6" alt="PyPI"></a>
  <a href="https://pypi.org/project/brooder/"><img src="https://img.shields.io/pypi/pyversions/brooder" alt="Python versions"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-blue" alt="License: Apache-2.0"></a>
  <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"></a>
</p>

**Snapshot testing for AI agents. Catch behavior regressions before they ship.**

Your AI agent is one model upgrade away from silently breaking. You bump the model, tweak a
prompt, or change a tool — and the agent starts behaving differently. You find out from a customer.

Brooder is the safety net. Wrap your agent once, and Brooder records its real runs as **golden
baselines**. Every time you change the model, a prompt, or a tool, it re-runs and shows you a
**behavioral diff** — what changed, what broke — and fails your CI if it regressed.

No eval datasets to hand-write. One command. It's `jest --updateSnapshot`, but for agents.

```bash
pip install brooder
```

<p align="center">
  <img src="assets/demo.svg" alt="brooder migrate catching a dropped tool call and a flipped answer" width="760">
</p>

> Status: early alpha, built in public. Apache-2.0.

---

## 60-second demo (no API keys needed)

The included example agent simulates a model upgrade with an env var, so you can see Brooder catch
a real regression completely offline.

```bash
git clone https://github.com/agentbrooder/brooder && cd brooder
pip install -e .

# The signature move: what breaks if I migrate from one model to another?
brooder migrate --from gpt-4o --to gpt-5-new examples/regressing_agent.py
```

Output (abridged):

```
──────────────────────── Model Migration Report ────────────────────────
 1 of 3 cases change behavior when migrating gpt-4o → gpt-5-new.

 support-agent · e1ded4070eee · REGRESSED · stability 40
   path diverged at step 0: was TOOL create_ticket(order=12345), now dropped
   - trajectory[0]  {'name': 'create_ticket', 'args': {'order': '12345'}}
   ~ output
       before: I've started your refund.
       after:  Refunds are not supported.
```

The "new model" silently stopped creating the refund ticket **and** flipped its answer. That would
have shipped to production unnoticed. Brooder caught it — and exited non-zero, so CI would block it.

---

## The normal workflow

```bash
brooder record examples/regressing_agent.py     # capture golden baselines from real runs
brooder run    examples/regressing_agent.py     # re-run after a change, diff vs baseline
brooder diff                                    # see exactly what changed
brooder approve                                 # accept the new behavior as the baseline
```

`brooder run` exits non-zero when behavior regressed — drop it into CI and it gates your PRs.

---

## Instrument your own agent

Add one decorator. Log tool calls with one function. That's the whole SDK.

```python
import brooder

def search_kb(query):
    brooder.tool_call("search_kb", {"query": query}, result="...")
    return "..."

@brooder.record("support-agent")
def agent(question: str) -> str:
    docs = search_kb(question)
    return answer_from(docs)

# call it over your real inputs; brooder records/replays automatically
```

Then run it through the CLI. Baselines are plain JSON committed to your repo, so diffs show up in
code review like any other change.

---

## Auto-capture (no manual `tool_call`)

Wrap your LLM client and Brooder records the model's tool-call decisions automatically:

```python
import brooder
import openai

client = brooder.instrument(openai.OpenAI())
# now every client.chat.completions.create(...) call is captured while recording
```

Supported providers: **OpenAI**, **Azure OpenAI**, **Anthropic**, **AWS Bedrock**, and
**Google (Gemini / Vertex)**. The provider is auto-detected; override it with
`brooder.instrument(client, provider="bedrock")`. Model *names* are intentionally not diffed, so
switching models isn't itself a change — only the model's *behavior* (which tools it calls, with
what arguments) is.

**Async works too.** `@brooder.record` and `instrument(...)` handle `async def` agents and async
clients — `AsyncOpenAI`, `AsyncAzureOpenAI`, `AsyncAnthropic`, and Google's `generate_content_async`
— with no extra setup (the recording context follows your `await`s and into child tasks):

```python
client = brooder.instrument(openai.AsyncOpenAI())

@brooder.record("support-agent")
async def agent(question: str) -> str:
    await client.chat.completions.create(model="gpt-4o", messages=[...])
    ...
```

(Async AWS Bedrock via aioboto3 isn't covered yet — the sync boto3 client is.)

## Capture from agent frameworks (OpenTelemetry)

Building on an agent framework? If it emits OpenTelemetry GenAI spans — **LangGraph, CrewAI,
AutoGen**, and anything else on the convention — add one span processor and Brooder ingests the
whole trajectory, no manual `tool_call`:

```python
from opentelemetry import trace
from brooder.integrations.otel import BrooderSpanProcessor

trace.get_tracer_provider().add_span_processor(BrooderSpanProcessor(agent="support-agent"))
```

It maps inference spans → turns, `execute_tool` spans → tool calls, and the agent-root span's
input/output → the case identity and final answer. It also drops straight into the OTel pipelines
you already run (Datadog / Arize / Honeycomb).

Building directly on the **Claude Agent SDK**? Register Brooder's hooks and it records the tool
trajectory automatically:

```python
import brooder
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, ResultMessage
from brooder.integrations import claude_agent

options = ClaudeAgentOptions(hooks=brooder.claude_agent_hooks(agent="support-agent"))
async with ClaudeSDKClient(options=options) as client:
    await client.query(prompt)
    async for msg in client.receive_response():
        if isinstance(msg, ResultMessage):
            claude_agent.record_output(msg.session_id, msg.result)  # optional: capture the answer
```

`UserPromptSubmit` opens a run (the prompt is the case identity), `PostToolUse` becomes a tool step,
and `Stop` finalizes it.

On the **OpenAI Agents SDK**? Its tracing is on by default — install Brooder's trace processor once
and every run is captured (no OpenAI API key required for capture):

```python
import brooder.integrations.openai_agents as bd_agents

bd_agents.install(agent="support-agent")   # then run your agents as usual
```

It maps generation/response spans → turns, function spans → tool calls, and handoffs and triggered
guardrails into the trajectory too — so both tool selection *and* control-flow regressions get
diffed.

Using **LangChain or LangGraph**? Attach one callback handler — no OpenTelemetry setup required:

```python
import brooder.integrations.langchain as bd_lc

handler = bd_lc.callback_handler(agent="support-agent")
graph.invoke({"messages": [...]}, config={"callbacks": [handler]})
```

The root chain start opens a run (its input is the case identity), model calls become turns, and
tool calls become tool steps — one handler covers both LangChain and LangGraph.

## It tests agents (the whole trajectory), not single LLM calls

`@brooder.record` wraps your **entire agent** — every step of its plan → act → observe loop.
The baseline is the full **trajectory**: every tool call across every turn, in order, plus the
final output. So Brooder catches agent-level regressions, not just token changes in one model
response.

```bash
# A multi-step agent that silently stops verifying before answering on the newer model:
brooder migrate --from gpt-4o --to gpt-5-new examples/loop_agent.py
# -> REGRESSED: trajectory[1] "verify" removed
```

That dropped `verify` step happened *inside the loop* — the kind of thing an LLM-output eval
would never see.

## Why not just use observability / eval tools?

| Tool type | Examples | What it does | The gap Brooder fills |
| --- | --- | --- | --- |
| Observability | Langfuse, Laminar, Phoenix | Trace/monitor **after** it runs | Doesn't gate **before** you ship |
| Eval frameworks | DeepEval, Braintrust, Ragas | Score against **hand-written** datasets | Requires eval authoring nobody maintains |
| **Brooder** | — | **Record real runs → behavioral diff on every change → CI gate** | **Zero eval-writing, catches model-migration regressions** |

---

## Gate your PRs (GitHub Action)

Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff,
and fails the check when behavior regresses. Copy [examples/github-action.yml](examples/github-action.yml)
to `.github/workflows/brooder.yml`:

```yaml
permissions:
  contents: read
  pull-requests: write        # so it can comment the diff

jobs:
  agent-snapshot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: agentbrooder/brooder@v1
        with:
          script: tests/agent_snapshot.py
```

The comment is upserted (updated in place, not spammed) and looks like the `--format markdown`
output below.

## Machine-readable output (`--json` / OTLP)

`run`, `ci`, and `diff` take `--format table|json|markdown` (`--json` is a shortcut). Exit codes are
unchanged, so you can gate *and* parse:

```bash
brooder run agent.py --json | jq '.summary'
# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1, "mean_stability": 80 }
```

For dashboards, point Brooder at any OTLP endpoint and each run emits a snapshot of gauges
(`brooder.cases.*`, `brooder.stability.mean`) — **one exporter** that reaches Datadog, Grafana,
Honeycomb, and CloudWatch:

```bash
pip install 'brooder[otel]'
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/metrics   # or metrics.otlp_endpoint in brooder.yaml
brooder ci agent.py
```

---

## What it checks

- **Structural diff** — the sequence of tool calls, their arguments, and the final output.
- **Semantic diff** — a pluggable judge (`judge: exact | llm`) so equivalent wording isn't a regression.
- **Flakiness** — `brooder run --runs 3` runs each case N times and flags non-determinism (`FLAKY`).

Each case gets a verdict — `PASS` / `REGRESSED` / `NEW` / `FLAKY` — and a stability score.

---

## Roadmap

See **[ROADMAP.md](ROADMAP.md)** for what's shipped and what's planned.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Issues and PRs welcome — this is being built in public.

## License

[Apache-2.0](LICENSE).
