Metadata-Version: 2.4
Name: agentproof
Version: 0.1.0
Summary: pytest-based behavioral testing framework for AI agents
Project-URL: Homepage, https://sarakdahal.com
Project-URL: Documentation, https://github.com/praxiumlabs/agentproof#readme
Project-URL: Repository, https://github.com/praxiumlabs/agentproof
Project-URL: Issues, https://github.com/praxiumlabs/agentproof/issues
Author-email: Sarak Dahal <dahal9sarak@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,behavioral-testing,llm,observability,pytest,testing
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.0
Provides-Extra: all
Requires-Dist: crewai>=0.1; extra == 'all'
Requires-Dist: langchain-core>=0.1; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: opentelemetry-api>=1.0; extra == 'all'
Requires-Dist: opentelemetry-sdk>=1.0; extra == 'all'
Provides-Extra: crewai
Requires-Dist: crewai>=0.1; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1; extra == 'langchain'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.0; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.0; extra == 'otel'
Description-Content-Type: text/markdown

# AgentProof

**pytest-based behavioral testing for AI agents.**

[![PyPI version](https://img.shields.io/pypi/v/agentproof.svg)](https://pypi.org/project/agentproof/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://img.shields.io/github/actions/workflow/status/praxiumlabs/agentproof/ci.yml?label=tests)](https://github.com/praxiumlabs/agentproof/actions)

```
pip install agentproof
```

**One line. No config. Works with any agent framework.**

---

## The Problem

You can observe your agents. You can trace them. You can log every token.

**But can you *test* them?**

89% of teams have agent observability. Only 52% have agent evaluation. Zero have behavioral testing in CI.

Your agent calls the wrong tool? Ships to production. Costs $40 on a $0.50 task? Ships to production. Hallucinates the answer? Ships. To. Production.

## The Solution

AgentProof brings **pytest-style behavioral testing** to AI agents. Test *what your agent does*, not just what it outputs.

```python
# test_booking_agent.py
import agentproof
from agentproof import assert_tool_called, assert_tool_order, assert_max_cost

def test_booking_agent_searches_before_booking(agent_run):
    run = make_booking_run()  # your agent execution

    # Did it use the right tools?
    assert_tool_called(run, "search_flights")
    assert_tool_called(run, "book_flight")

    # Did it use them in the right order?
    assert_tool_order(run, ["search_flights", "compare_prices", "book_flight"])

    # Did it stay within budget?
    assert_max_cost(run, max_usd=0.50)
```

```
$ pytest
========================= test session starts =========================
test_booking_agent.py::test_booking_agent_searches_before_booking PASSED
test_booking_agent.py::test_cost_stays_under_budget PASSED
test_booking_agent.py::test_no_hallucination PASSED
========================= 3 passed in 0.04s ============================
```

## Quickstart

### 1. Install

```bash
pip install agentproof
```

### 2. Record an agent run

```python
import agentproof

@agentproof.record
def my_agent(prompt: str) -> str:
    # Your agent code here
    agentproof.add_tool_call("search", arguments={"query": prompt})
    agentproof.add_llm_call("gpt-4o", prompt_tokens=500, completion_tokens=200)
    return "The answer is 42"

result = my_agent("What is the meaning of life?")
run = my_agent.last_run
```

### 3. Write tests

```python
from agentproof import (
    assert_tool_called,
    assert_tool_order,
    assert_max_cost,
    assert_max_steps,
    assert_no_hallucination,
)

def test_my_agent():
    run = my_agent.last_run

    assert_tool_called(run, "search")
    assert_max_cost(run, 0.10)
    assert_max_steps(run, 5)
    assert_no_hallucination(run, ground_truth="42")
```

## Core Assertions

| Assertion | What it tests |
|---|---|
| `assert_tool_called(run, "search")` | Tool was used (optionally N times) |
| `assert_tool_order(run, ["a", "b", "c"])` | Tools called in correct sequence |
| `assert_max_cost(run, 0.50)` | Total cost under budget (200+ models) |
| `assert_max_steps(run, 10)` | Agent didn't spin out |
| `assert_no_hallucination(run, truth)` | Output grounded in source (TF-IDF, no API) |

## Framework Adapters

Works with any agent framework. Install the adapter you need:

```bash
pip install agentproof[langchain]
pip install agentproof[crewai]
pip install agentproof[openai]
pip install agentproof[otel]      # Any OTEL-instrumented framework
```

```python
# LangChain
from agentproof.adapters.langchain import from_langchain_run
run = from_langchain_run(agent_executor.invoke({"input": "..."}))

# CrewAI
from agentproof.adapters.crewai import from_crewai_result
run = from_crewai_result(crew.kickoff())

# OpenAI Agents SDK
from agentproof.adapters.openai_agents import from_openai_response
run = from_openai_response(Runner.run(agent, "..."))
```

## Replay Testing

Record once, test forever. Capture a real agent run and replay it in CI without making API calls:

```python
from agentproof import save_trace, load_trace

# Record
save_trace(run, "tests/fixtures/booking_success.jsonl")

# Replay in tests
def test_booking_replay():
    run = load_trace("tests/fixtures/booking_success.jsonl")
    assert_tool_called(run, "book_flight")
    assert_max_cost(run, 0.50)
```

## Snapshot Testing

Like Jest snapshots but for agent behavior. Detect behavioral regressions automatically:

```python
from agentproof import assert_snapshot

def test_agent_behavior_stable(snapshot_dir):
    run = my_agent("book a flight to NYC")
    assert_snapshot(run, snapshot_dir / "booking_agent.json")
    # First run: creates snapshot
    # Subsequent runs: compares against saved snapshot
```

Update snapshots: `pytest --agentproof-update-snapshots`

## Bundled Cost Database

200+ LLM models with up-to-date pricing. No API calls needed.

```python
from agentproof import get_model_pricing, calculate_run_cost

pricing = get_model_pricing("gpt-4o")
# {"provider": "openai", "prompt": 2.50, "completion": 10.00}

cost = calculate_run_cost(run)
# CostBreakdown(total_cost_usd=0.0325, ...)
```

Models: GPT-4o, Claude 4.5, Gemini 2.5, Llama 4, DeepSeek V3, Mistral Large, and 190+ more.

## CI/CD Integration

### GitHub Actions

```yaml
- uses: praxiumlabs/agentproof@v1
  with:
    test-path: tests/
```

Or manually:

```yaml
- run: |
    pip install agentproof
    pytest tests/ -v
```

## Comparison

| Feature | AgentProof | DeepEval | Braintrust |
|---|:---:|:---:|:---:|
| pytest native | Yes | Yes | No |
| Behavioral assertions | Yes | No | No |
| Tool sequence testing | Yes | No | No |
| Cost assertions | Yes | No | No |
| No API key needed | Yes | No | No |
| JSONL replay | Yes | No | No |
| Snapshot testing | Yes | No | No |
| Framework adapters | 4 | 2 | 1 |
| Package size | <500KB | ~50MB | Cloud |
| Price | Free | Freemium | Paid |

## Honest Limitations

- **Hallucination detection uses TF-IDF** — it catches obvious fabrication, not subtle inaccuracies. For LLM-as-judge evaluation, use DeepEval or Braintrust.
- **Token counts must be provided** — we don't intercept API calls (by design). Use framework adapters or log tokens manually.
- **No real-time monitoring** — AgentProof is a testing tool, not an observability platform. Use it alongside your existing tracing.
- **Cost DB needs community updates** — model pricing changes. Submit a PR to update `agentproof/data/models.yaml`.

## Contributing

```bash
git clone https://github.com/praxiumlabs/agentproof
cd agentproof
pip install -e ".[dev]"
pytest
```

**Adding a model to the cost DB:**

Edit `agentproof/data/models.yaml`:
```yaml
new-model-name:
  provider: provider-name
  prompt: 1.50
  completion: 5.00
```

## License

MIT License. See [LICENSE](LICENSE).

---

Built by [Sarak Dahal](https://sarakdahal.com) at [Praxium Labs](https://github.com/praxiumlabs). Star the repo if AgentProof saves your agents.
