Metadata-Version: 2.4
Name: mimic-recording
Version: 1.0.0
Summary: The pytest for AI agents. Record, replay, assert, and diff agent behavior.
Author-email: Mimic <team@mimic.dev>
License: MIT
Project-URL: Homepage, https://github.com/mimic-ai/mimic
Project-URL: Documentation, https://docs.mimic.dev
Project-URL: Repository, https://github.com/mimic-ai/mimic
Project-URL: Issues, https://github.com/mimic-ai/mimic/issues
Keywords: ai,agents,testing,llm,evals,replay,observability
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.18.0; extra == "anthropic"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Dynamic: license-file

# Mimic

> The pytest for AI agents. Record, replay, assert, and diff agent behavior.

[![PyPI](https://img.shields.io/pypi/v/mimic-ai)](https://pypi.org/project/mimic-ai/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

Mimic is an open-source library that lets you **record** an AI agent's behavior, **replay** it deterministically, **assert** properties about it, and **diff** runs across versions. It's the missing testing layer for the agent era.

```python
from mimic import Mimic, assert_that, replay
from mimic.integrations.openai import tracked_completion

mimic = Mimic()
client = OpenAI()

@mimic.record("customer-support-agent", model="gpt-4o")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    tracked_completion(resp)  # auto-captures tokens + cost
    return resp.choices[0].message.content
```

## Verified performance

| Scenario | Record mode | Replay mode | Savings |
|---|---|---|---|
| 5-test multi-step agent suite | 360 ms | 50 ms | **7× faster** |
| 1000 CI runs of the same suite | ~$2 in LLM cost | **$0** | **100%** |

Run `mimic benchmark --runs 1000` on your own recordings to see your numbers.

## Why Mimic?

Every team building AI agents hits the same wall:

- **"I changed a prompt. Did I break anything?"** — You don't know.
- **"I switched from GPT-4 to Claude. Is it 2x more expensive?"** — You don't know.
- **"Did this agent ever call `delete_file` in production?"** — You don't know.
- **"Why did the agent fail on Tuesday at 3pm?"** — You don't know.

Mimic turns those unknowns into testable, replayable, diffable artifacts. Think **Sentry recordings + pytest assertions + git blame**, purpose-built for LLM agents.

## Features

- ✅ **Record** any callable — sync or async, LLM calls, tool use, multi-step agents
- ✅ **Replay** runs offline with **zero API cost**, byte-for-byte deterministic
- ✅ **Assert** behavioral properties: cost, latency, tool usage, output content
- ✅ **Diff** two runs to see exactly what changed
- ✅ **Auto-track LLM costs** for OpenAI, Anthropic, Gemini (zero-config)
- ✅ **Multi-step agents** with per-step recording, cost, and metadata
- ✅ **Privacy mode** (`capture_args=False`, `capture_return=False`)
- ✅ **Storage-agnostic** — filesystem by default, pluggable for S3/Postgres
- ✅ **Zero LLM vendor lock-in** — works with any model
- ✅ **Beautiful CLI** — `mimic run / list / show / diff / report / benchmark`
- ✅ **CI-ready** — GitHub Actions template + pre-commit hook included

## Install

```bash
pip install mimic-ai
```

Or with optional integrations:

```bash
pip install mimic-ai[openai]
pip install mimic-ai[anthropic]
```

## Quick start

```bash
mkdir my-agent && cd my-agent
mimic init
```

This creates a project skeleton:

```
my-agent/
├── mimic.yaml           # Project config
├── tests/
│   └── test_agent.py    # Your recorded tests
└── .mimic/              # Recorded runs (gitignored by default)
```

Edit `tests/test_agent.py`:

```python
from mimic import Mimic, assert_that, replay

mimic = Mimic()

@mimic.record("my-agent", model="gpt-4o")
def answer(question: str) -> str:
    # ... your LLM call here ...
    return "..."

def test_agent():
    answer("hello")
    recorded = replay("my-agent")
    assert_that(recorded).finished_without_errors()
    assert_that(recorded).cost_less_than(usd=0.05)
    assert_that(recorded).did_not_call_tool("delete_database")
```

Run it:

```bash
mimic run tests/                 # records + runs (costs $$)
MIMIC_MODE=replay mimic run tests/  # replays only (free, deterministic)
```

## Multi-step agents

For ReAct, multi-agent, or any agent with multiple LLM/tool calls, record each step:

```python
@mimic.record("research-agent")
async def research(question: str) -> str:
    # Step 1: plan
    with mimic.step("plan", model="gpt-4o-mini") as s:
        resp = await llm.complete(model="gpt-4o-mini", messages=[...])
        tracked_completion(resp)
        s.metadata["plan_steps"] = 3

    # Step 2: search
    with mimic.step("search") as s:
        results = await web_search(question)
        s.metadata["result_count"] = len(results)

    # Step 3: synthesize
    with mimic.step("synthesize", model="gpt-4o") as s:
        resp = await llm.complete(model="gpt-4o", messages=[...])
        tracked_completion(resp)

    return summary
```

## Assertions

The full chain (all return `self` for fluent chaining):

```python
assert_that(run).finished_without_errors()
assert_that(run).had_error()                   # inverse
assert_that(run).cost_less_than(usd=0.05)
assert_that(run).completed_under(ms=2000)
assert_that(run).output_contains("substring")
assert_that(run).output_matches(r"regex")
assert_that(run).output_equals(value)
assert_that(run).called_tool("search")
assert_that(run).did_not_call_tool("delete_database")
assert_that(run).called_tools(["search", "synthesize"])
assert_that(run).had_exactly(3)
assert_that(run).had_at_least(2)
assert_that(run).used_model("gpt-4o")
```

## How it works

Mimic sits **outside** your agent code, watching the inputs and outputs of any function you decorate. The first time the function runs, Mimic records the full execution into a content-addressable store. Subsequent test runs use the stored record instead of calling the LLM, making them fast, free, and deterministic.

For multi-step agents, Mimic records each step separately, so you can replay just the broken step without re-running the whole agent.

## CI integration

Drop the included `.github/workflows/test.yml` into your repo. It runs your test suite in replay mode (no LLM cost) and validates that no cost was incurred.

Manual re-recording is a separate job, triggered on `workflow_dispatch` or a schedule.

## The recording format

Mimic recordings are plain JSON conforming to a documented schema — see [`RECORDING_FORMAT.md`](RECORDING_FORMAT.md). The format is **vendor-neutral**: you can build readers, web UIs, or analysis tools without depending on the Mimic library.

## The $100M thesis

Mimic sits at the intersection of three exploding markets:

1. **AI agent development** — 10M+ developers will build agents by 2027.
2. **AI observability** — already a $2B+ market, dominated by closed vendors (LangSmith, Helicone, Langfuse).
3. **AI safety & compliance** — every enterprise deploying agents needs guardrails, audit trails, and replay.

The land-and-expand model is proven (Sentry, Supabase, GitLab, Vercel, PostHog): open source core → community growth → enterprise tier with self-hosted, SSO, audit logs, and SOC2.

See [`BUSINESS_PLAN.md`](BUSINESS_PLAN.md) for the full strategy.

## Roadmap

- [x] v0.1 — Record/replay/assert core
- [x] v0.2 — Async + multi-step + OpenAI/Anthropic cost tracking
- [ ] v0.3 — Web UI for browsing recorded runs
- [ ] v0.4 — TypeScript SDK
- [ ] v0.5 — Auto-generated regression tests from production traces
- [ ] v0.6 — Multi-agent parent/child traces
- [ ] v1.0 — Enterprise self-hosted edition

## Contributing

We love contributions. See [`CONTRIBUTING.md`](CONTRIBUTING.md).

## License

MIT — see [`LICENSE`](LICENSE).
