Metadata-Version: 2.4
Name: claude-agent-cassette
Version: 0.3.0
Summary: Record & replay the claude-agent-sdk wire for deterministic, offline tests.
Project-URL: Homepage, https://github.com/oneryalcin/claude-agent-cassette
Author: Mehmet Öner Yalçın
License: MIT
License-File: LICENSE
Keywords: cassette,claude,claude-agent-sdk,replay,testing,vcr
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: claude-agent-sdk<0.3,>=0.2.82
Requires-Dist: typing-extensions>=4.0
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Description-Content-Type: text/markdown

# Claude Agent Cassette

Record & replay the [`claude-agent-sdk`](https://github.com/anthropics/claude-agent-sdk-python)
wire for **deterministic, offline tests** — no API key, no subprocess, no mocks.

## Why

Apps built on `claude-agent-sdk` read a stream of typed messages (assistant turns,
tool results, task notifications, control-protocol frames) and drive logic off
them. The nasty bugs live at that **stream → your-handler seam**: the SDK emits a
slightly different *shape* than you expected, and your handler quietly does the
wrong thing.

Mocked tests can't catch this — you build the mock, so you only test your
understanding of your own mock. A cassette records the **real** wire once and
replays it through the SDK's **real** parser, so:

- a shape change in the SDK turns your test red instead of shipping to prod;
- tests run with no API cost, no network, no `claude` subprocess;
- the replayed frames go through the genuine `message_parser`, not a stand-in.

```
  PRODUCTION:   real CLI ──raw frames──► SDK parser ──► your code
                                              ▲
  REPLAY:       ReplayTransport ──raw frames──┘   (same parser, same code)
```

## Install

```bash
pip install claude-agent-cassette   # (or: uv add claude-agent-cassette)
```

## Replay (the common case — offline, no key)

```python
from claude_agent_cassette import replay, load_frames

async def test_my_handler():
    async with replay(load_frames("tests/cassettes/happy_path.jsonl")) as client:
        kinds = []
        async for m in client.receive_messages():
            kinds.append(type(m).__name__)
            if kinds[-1] == "ResultMessage":
                break  # stream stays open after the result; break like the real wire
        assert "ResultMessage" in kinds
        # ...or feed client.receive_messages() into your own dispatcher and
        #    assert on what it produces.
```

A frames file is JSONL of raw inbound stream-json frames — the exact dicts the
CLI emits. `replay()` injects them into a real `ClaudeSDKClient` and answers the
SDK's `initialize` control handshake for you. (Vocabulary: a **frame** is a raw
wire dict; a **message** is the typed object the SDK parses it into; a **tape**
is a full duplex recording.)

## Record (capture a real session)

`record()` works with **both** SDK entry points — the one-shot `query()`
and the interactive `ClaudeSDKClient` (it patches both transport-construction
sites the SDK uses):

```python
from claude_agent_cassette import record, save_tape

# one-shot query()
from claude_agent_sdk import query

with record() as tape:                  # tees the full duplex wire
    async for _ in query(prompt="...", options=...):
        pass
save_tape(tape, "session.jsonl")
```

```python
# interactive ClaudeSDKClient
from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient

with record() as tape:
    async with ClaudeSDKClient(options=ClaudeAgentOptions()) as client:
        await client.query("...")
        async for _ in client.receive_messages():
            pass
save_tape(tape, "session.jsonl")
```

`record()` captures **both directions, including the control plane**
(`control_request`/`control_response`, `mcp_message`, `hook_callback`, the
handshake), so one recording can feed both conversation replay and
control-protocol replay. Derive the conversation-only frames with
`conversation_frames(tape)`.

## Drift detection (gate SDK bumps)

Re-parse a cassette's message frames through the **installed** SDK's own
`message_parser`. A frame that no longer parses — or whose content blocks the
parser silently drops — is flagged. Because it reuses the SDK's own parser, there
is no schema to maintain: the judge is the thing being judged.

```bash
claude-agent-cassette drift tests/cassettes/      # *.jsonl files, or dirs of them
```

```text
drift: 5 cassette(s) vs claude-agent-sdk 0.2.87

  ok    happy_path.jsonl
  DRIFT stop_midtask.jsonl — 1 frame(s):
          frame[3] assistant: content_dropped — 1 of 2 content block(s) dropped on parse
  ok    notification.jsonl

5 checked, 1 drifted (1 frame) — re-record the drifted cassettes.
```

- Exits **non-zero on drift** — use it to gate an SDK-bump PR in CI.
- **Fails closed**: if no cassette files are found it exits non-zero (a mispointed
  path can't pass as a false green); pass `--allow-empty` to override.
- **Two cassette layouts** in a directory: *flat* (top-level `*.jsonl`) or *nested*
  (`<name>/input.jsonl`, where each cassette is a dir holding the recording plus
  sidecars). Nested is auto-detected; only `input.jsonl` is checked, so sibling
  `expected.jsonl` / `meta.json` are ignored, and a drift row is named by the
  cassette dir. Use `--input-name FILE` for a different recording filename. A dir
  mixing both layouts is rejected (it can't silently check only half).
- Three drift signals: `parse_error` (the parser rejected the frame), `unrecognized_type`
  (the message type is gone), `content_dropped` (a content block silently vanished).
- **Scope**: catches *parse-level* drift (rejected/skipped frames) + dropped content
  blocks. It does **not** catch additive *field-level* drift (a still-parsing frame
  that gained a field) — see [ROADMAP.md](ROADMAP.md).

In Python: `parse_drift(frames)` / `check_drift(tape)` → `list[DriftFinding]`.

## Control-protocol replay (the duplex wire)

`replay_tape(tape, mode=...)` replays a full duplex recording, including the control
plane, through a real `ClaudeSDKClient`. Break at the terminal `ResultMessage` (the
stream stays open after it, like the real wire):

```python
from claude_agent_cassette import replay_tape, load_tape

async def test_permission_flow():
    async with replay_tape(load_tape("session.jsonl"), mode="stub") as client:
        async for m in client.receive_messages():
            if type(m).__name__ == "ResultMessage":
                break
```

- **`mode="inert"`** (default) — conversation + **Direction-A** control replay: the
  `initialize` / `mcp_status` handshakes are answered from the recording; inbound
  **Direction-B** requests (`can_use_tool` / `hook_callback` / `mcp_message`) are
  dropped, so your registered callbacks stay inert.
- **`mode="stub"`** — also replay **Direction-B**: the recorded requests are delivered
  to the SDK and answered from the tape by stubs that **replace** your `can_use_tool` /
  hooks / SDK MCP servers (for `mcp_message`, a real in-process MCP server is
  synthesized from the recorded `initialize` / `tools/list` / `tools/call` traffic).
  Deterministic and inert — it certifies the recorded *wire*, not your policy.
- **`mode="verify"`** — the recorded Direction-B requests are delivered to **your real**
  `can_use_tool` / `hooks` / SDK MCP servers (nothing is replaced), and on exit each
  live decision is diffed against the recorded one — matched by `request_id`, at the
  wire. This certifies your *policy* still produces the recorded decisions: a changed
  decision or tool result, a callback that now raises (or no longer does), or an
  unanswered exchange is divergence.
- **Fail-closed end-to-end.** In `"stub"` and `"verify"` modes, any divergence from the
  tape — a live request with no recorded match, an exhausted or error decision, hook ids
  the SDK didn't reproduce, a live decision that differs from the recording, or recorded
  exchanges left unreplayed — raises `CassetteMismatchError` when the `async with` exits.
  (The SDK swallows callback exceptions into error responses, so the divergence is
  collected and surfaced on exit, not inside the callback.) A Direction-B subtype with
  no replay support (one a future SDK adds) raises up front — use `mode="inert"`.
- **Recording** a Direction-B tape needs the control decisions preserved. `scrub_tape(tape,
  replacements)` blanks PII *values* while keeping decisions intact; `lint_tape(tape)`
  lints whether a tape is still replayable (run it after scrubbing). See
  [`examples/record_permission_session.py`](examples/record_permission_session.py).

## Examples

[`examples/`](examples/) has a runnable, no-key demo:

```bash
python examples/replay_cassette.py
# AssistantMessage:
# ResultMessage: Hello! How can I help?
```

It replays the saved [`examples/cassettes/hello_world.jsonl`](examples/cassettes/hello_world.jsonl)
through a real `ClaudeSDKClient`. (That cassette is a small, illustrative
hand-written sample with realistic wire shapes; real cassettes are *recorded* —
see above.)

The three recorder scripts each capture one Direction-B subtype as a
decision-preserving, scrubbed fixture (they spend a small API call to re-record;
the committed fixtures in [`examples/cassettes/`](examples/cassettes/) replay
offline):

- [`record_permission_session.py`](examples/record_permission_session.py) — `can_use_tool`
  (allow, allow + `updatedInput` redirect, deny)
- [`record_hooks_session.py`](examples/record_hooks_session.py) — `hook_callback` (PreToolUse)
- [`record_mcp_session.py`](examples/record_mcp_session.py) — `mcp_message`
  (in-process MCP calculator; one normal + one `is_error` tool result)

## API

| | |
| --- | --- |
| `record()` | CM that wraps the SDK's transport to capture a session's full duplex wire as a tape |
| `replay(frames, options=None)` | async CM → a connected `ClaudeSDKClient` replaying raw frames |
| `replay_tape(tape, options=None, mode=...)` | async CM → replay a full duplex tape incl. the control plane; `ReplayMode = "inert" \| "stub" \| "verify"` |
| `save_tape(tape, path)` / `load_tape(path)` | tape I/O (JSONL) |
| `load_frames(path)` | load a frames file for `replay()` |
| `inbound_frames(tape)` / `conversation_frames(tape)` | derive frame views from a tape (all inbound / conversation-only) |
| `direction_b_exchanges(tape)` → `{subtype: [ControlExchange]}` | inspect the recorded Direction-B decisions (what was allowed/denied/answered) |
| `scrub_tape(tape, replacements)` | decision-preserving PII scrub for sharing a recording |
| `lint_tape(tape)` | lint a tape for Direction-B replayability (run after scrubbing) |
| `check_drift(tape)` / `parse_drift(frames)` → `list[DriftFinding]` | drift findings vs the installed SDK |
| `ReplayTransport(frames)` / `.from_tape(tape, keep_subtypes=None)` | the transport under `replay`/`replay_tape`, for wiring a client by hand |
| `RecordingTransport(inner, tape)` | passive MITM tee, both directions |
| `CassetteMismatchError` | replay diverged from the recording (always fail-closed) |
| `TapeEntry` / `Frame` | the tape entry and raw-frame types |
| `claude-agent-cassette drift <path…>` | CLI drift gate (non-zero on drift / empty) |

## How it works (the non-obvious bits)

- **Replay rides the public `Transport` ABC** (`ClaudeSDKClient(transport=...)`,
  stable since SDK 0.0.22). It's solid across versions.
- **The initialize handshake**: `connect()` writes a `control_request` with a
  fresh `request_id` and blocks until it sees a `control_response` echoing it. So
  `ReplayTransport` reads that id off `write()` and synthesises the response —
  otherwise replay hangs.
- **Record patches two sites**: `ClaudeSDKClient` does a call-time import of the
  transport from its source module, while one-shot `query()` uses the name bound
  in `_internal.client`. Patching only one silently misses the other.

## Compatibility

Replay uses only the public `Transport` API. **Record and drift reach into
`claude_agent_sdk._internal`** (the subprocess transport, control-protocol shape,
and `message_parser`), so they are version-sensitive — this release targets
`claude-agent-sdk 0.2.x`. Pin your SDK and re-verify on bumps. (Drift being
version-sensitive is the point: it tells you *when* a bump broke a cassette.)

## Roadmap

See [ROADMAP.md](ROADMAP.md). Shipped: conversation replay, recording,
**Direction-A control replay** (`ReplayTransport.from_tape`), **drift detection**,
**Direction-B replay for all three subtypes** (`can_use_tool` / `hook_callback` /
`mcp_message`, in both `mode="stub"` and `mode="verify"`), and a
**decision-preserving scrub** (`scrub_tape`). Next up: `interrupt` lockstep, a pytest
plugin with record-on-miss, and field-level drift.

## License

MIT.
