Metadata-Version: 2.4
Name: agentfuzz
Version: 0.4.0
Summary: Chaos engineering for AI agents — inject realistic production failures (tool timeouts, malformed responses, cost spirals, prompt injection) and find out what breaks before your users do.
Project-URL: Homepage, https://github.com/SubhashPavan/agentfuzz
Project-URL: Documentation, https://github.com/SubhashPavan/agentfuzz#readme
Project-URL: Repository, https://github.com/SubhashPavan/agentfuzz
Project-URL: Issues, https://github.com/SubhashPavan/agentfuzz/issues
Author: Pavan Subhash Tirumalasetti
License: Apache-2.0
License-File: LICENSE
Keywords: agent-evaluation,agents,ai-agents,autogen,chaos-engineering,crewai,fault-injection,langgraph,llm,prompt-injection,reliability,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic>=2.6
Requires-Dist: rich>=13.7
Requires-Dist: typer>=0.12
Provides-Extra: all
Requires-Dist: autogen-agentchat>=0.4; extra == 'all'
Requires-Dist: crewai>=0.70; extra == 'all'
Requires-Dist: langchain-core>=0.3; extra == 'all'
Requires-Dist: langchain>=1.0; extra == 'all'
Requires-Dist: langgraph>=0.2; extra == 'all'
Provides-Extra: autogen
Requires-Dist: autogen-agentchat>=0.4; extra == 'autogen'
Provides-Extra: crewai
Requires-Dist: crewai>=0.70; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: langgraph
Requires-Dist: langchain-core>=0.3; extra == 'langgraph'
Requires-Dist: langchain>=1.0; extra == 'langgraph'
Requires-Dist: langgraph>=0.2; extra == 'langgraph'
Description-Content-Type: text/markdown

<div align="center">

# agentfuzz

**Chaos engineering for AI agents.**

Your agent works in the demo. In production it breaks because a tool times out,
an API returns garbage JSON, a user injects a prompt, or it spirals into an
infinite tool-call loop burning $200 in tokens. `agentfuzz` finds those failures
before your users do.

[![PyPI](https://img.shields.io/pypi/v/agentfuzz.svg)](https://pypi.org/project/agentfuzz/)
[![Python](https://img.shields.io/pypi/pyversions/agentfuzz.svg)](https://pypi.org/project/agentfuzz/)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Status](https://img.shields.io/badge/status-alpha-orange.svg)](#status)

</div>

---

## Why this exists

Netflix built Chaos Monkey because cloud apps that passed unit tests still went
down in production — the failures were in *the seams between systems*, not the
systems themselves. AI agents have the same problem, with a worse blast radius:

- A flaky tool returns malformed JSON → your agent hallucinates plausible-looking
  arguments and writes them to your database.
- A user pastes a "translate this" prompt that's actually `IGNORE PREVIOUS
  INSTRUCTIONS` → your support agent emails the customer your system prompt.
- A model upgrade changes how the agent retries a 429 → the agent enters an
  infinite loop and burns through your monthly token budget in 40 minutes.

These failures don't show up in unit tests because unit tests assume the seams
work. `agentfuzz` deliberately breaks the seams.

## What it does

Wrap your agent. Pick a fault profile. Run. Get a report.

```python
from agentfuzz import Harness, faults

harness = Harness(my_agent)

harness.add(faults.ToolTimeout(rate=0.10))
harness.add(faults.MalformedToolResponse(rate=0.05))
harness.add(faults.PromptInjection.suite("owasp-llm01"))
harness.add(faults.CostSpiral(max_tokens=50_000))
harness.add(faults.LatencyJitter(p99_ms=8000))
harness.add(faults.PartialToolFailure())

report = harness.run(scenarios="tau-bench-airline", iterations=200)
report.html("./report.html")
```

You get:

- **Pass-rate per fault category** — "your agent survives malformed JSON 78% of
  the time but only 12% of timeout cases."
- **Cost-blast radius** — "fault X caused token usage to spike 14×."
- **Tool-call failure modes** — hallucinated arguments, retry storms, infinite
  loops.
- **Prompt-injection survival** — OWASP LLM01 suite results.
- **Replay traces** — the exact transcript that broke your agent, so you can
  fix it.

## Install

```bash
pip install agentfuzz                       # core
pip install "agentfuzz[langgraph]"          # + LangGraph adapter
pip install "agentfuzz[crewai]"             # + CrewAI adapter
pip install "agentfuzz[autogen]"            # + AutoGen adapter
pip install "agentfuzz[all]"                # everything
```

## 60-second example

```python
from agentfuzz import Harness, faults
from my_app import build_agent

harness = Harness(build_agent())
harness.add(faults.MalformedToolResponse(rate=0.2))
harness.add(faults.ToolTimeout(rate=0.1))

result = harness.run(iterations=50)
print(result.summary())
# >>> agentfuzz: 32/50 passed (64%)
# >>>   MalformedToolResponse: 8 failures
# >>>     - 5× hallucinated arguments
# >>>     - 3× silent corruption
# >>>   ToolTimeout: 10 failures
# >>>     - 7× retry storm (avg 14 retries)
# >>>     - 3× infinite loop killed at max_tokens
```

## Fault library

| Fault | What it simulates |
|---|---|
| `ToolTimeout` | A downstream API hangs past the agent's patience |
| `MalformedToolResponse` | Garbage JSON, truncated payloads, wrong schema |
| `PartialToolFailure` | Tool returns 200 then errors mid-stream |
| `LatencyJitter` | Realistic p50 / p99 latency distribution |
| `CostSpiral` | Detects runaway token usage above a threshold |
| `PromptInjection` | OWASP LLM01 catalog of injection payloads |
| `PromptParaphrase` | Real users mangle messages — typos, filler, contractions |
| `RateLimitBurst` | Cascading 429s from upstream APIs |
| `SchemaDrift` | Tool API changed shape between dev and prod |
| `AuthExpiry` | 401 / 403 — tests credential-refresh paths |
| `NetworkPartition` | Connection refused / TLS error — distinct from timeout |

More planned — [see the roadmap](docs/roadmap.md).

## Supported agent frameworks

- ✅ **LangChain `create_agent` (1.x)** — `agentfuzz[langgraph]`. The modern entry point. Wrap your tools with `wrap_tools()`, point `LangGraphAdapter` at the compiled graph.
- ✅ **LangGraph `create_react_agent` (0.x)** — same adapter; both APIs return a `CompiledStateGraph` we handle uniformly. See [`examples/langgraph_react_agent.py`](examples/langgraph_react_agent.py).
- ✅ **CrewAI** — `agentfuzz[crewai]`. `wrap_tools()` returns proxy `crewai.tools.BaseTool` instances; `CrewAIAdapter(crew)` drives the harness through `crew.kickoff()`. See [`examples/crewai_agent.py`](examples/crewai_agent.py).
- ✅ **AutoGen v0.4+** — `agentfuzz[autogen]`. `wrap_tools()` returns proxy `autogen_core.tools.FunctionTool` instances; `AutoGenAdapter(agent)` drives any agent / team exposing async `run(task=...)`. See [`examples/autogen_agent.py`](examples/autogen_agent.py).
- ✅ **Plain Python callables** — any `Callable[[State], State]`. Simplest way to try the tool.
- 🚧 PydanticAI, OpenAI Swarm, LlamaIndex — coming.

The adapter interface is small (`is_available()` + `wrap()`); PRs welcome.

## Status

**Alpha (v0.1).** API will change. Built and tested on Python 3.10–3.13.
The fault catalog is informed by production multi-agent deployments at
enterprise scale — but every codebase fails in its own special way, so file
issues when you find a fault we should ship.

## Why I'm building this

I've spent the last decade architecting AI systems for enterprises — including
multi-agent platforms running across 2,600+ production sites. The failures
that hurt are almost never the ones the unit tests check for. They're the
quiet, partial, half-degraded ones in the seams.

This is the tool I wish I'd had.

— [Pavan Subhash Tirumalasetti](https://www.linkedin.com/in/pavan-subhash-tirumalasetti)

## License

Apache 2.0. Use it commercially. Cite it in papers. Build a paid product on
top. Just don't claim you wrote it.

## Citing

If you use `agentfuzz` in research or production reports:

```bibtex
@software{agentfuzz,
  author  = {Tirumalasetti, Pavan Subhash},
  title   = {agentfuzz: Chaos engineering for AI agents},
  year    = {2026},
  url     = {https://github.com/SubhashPavan/agentfuzz},
}
```
