Metadata-Version: 2.4
Name: robotframework-agentguard
Version: 0.2.1
Summary: Robot Framework library for testing Agent Skills, Hooks, SubAgents, and MCP Servers — provider-agnostic, BFCL-grade tool-call matching, #42796 behavioral metrics, statistical non-determinism handling.
Project-URL: Repository, https://github.com/manykarim/robotframework-agentguard
Project-URL: Documentation, https://github.com/manykarim/robotframework-agentguard#readme
Author: AgentGuard contributors
License: Apache-2.0
License-File: LICENSE
Keywords: a2a,agent-skills,agentguard,bfcl,llm-eval,mcp,robotframework
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Robot Framework :: Library
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.12
Requires-Dist: a2a-sdk>=1.0.2
Requires-Dist: anthropic>=0.97.0
Requires-Dist: docker>=7.1.0
Requires-Dist: fastmcp>=3.2.4
Requires-Dist: gitpython>=3.1
Requires-Dist: httpx>=0.28.1
Requires-Dist: inspect-ai>=0.3.213
Requires-Dist: inspect-evals>=0.10.0
Requires-Dist: jsonlines>=4.0
Requires-Dist: jsonschema>=4.23
Requires-Dist: litellm>=1.83.0
Requires-Dist: mcp>=1.27.0
Requires-Dist: numpy>=1.26
Requires-Dist: opentelemetry-api>=1.41.1
Requires-Dist: opentelemetry-sdk>=1.41.1
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=13.9
Requires-Dist: robotframework-assertion-engine<5.0,>=4.0
Requires-Dist: robotframework-pythonlibcore>=4.5.0
Requires-Dist: robotframework>=7.4.2
Requires-Dist: scipy>=1.17.1
Provides-Extra: benchmarks
Requires-Dist: datasets>=3.0; extra == 'benchmarks'
Requires-Dist: huggingface-hub>=0.26; extra == 'benchmarks'
Provides-Extra: bridges
Requires-Dist: crewai>=0.95; extra == 'bridges'
Requires-Dist: langgraph>=0.2; extra == 'bridges'
Requires-Dist: openai-agents>=0.0.7; extra == 'bridges'
Provides-Extra: integrations
Requires-Dist: rf-mcp>=0.30; extra == 'integrations'
Description-Content-Type: text/markdown

# robotframework-agentguard

[![License: Apache-2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![CI](https://img.shields.io/github/actions/workflow/status/manykarim/robotframework-agentguard/ci.yml?branch=main)](https://github.com/manykarim/robotframework-agentguard/actions)

> A Robot Framework library for testing **MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs** — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a `SKILL.md`, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a `.robot` file.

## What it tests

- **MCP servers** — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
- **Agent Skills** — `SKILL.md` discovery, frontmatter validation, Inspect-AI grading
- **Hooks** — synthesise the 12 Claude Code hook events and assert handler decisions
- **SubAgents** — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
- **Coding agents** — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
- **Public benchmarks** — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
- **Security** — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
- **Statistics** — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

## Installation

> **Note:** the package is pending publication to PyPI. Until then, install from the GitHub source.

**From PyPI** (once published):

```bash
pip install robotframework-agentguard
# or
uv add robotframework-agentguard
```

**From source (today):**

```bash
pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git
```

Optional extras:

```bash
pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench
```

Configure the LLM provider with a `.env` file in your project root:

```bash
OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini
```

Verify the install:

```bash
agentguard doctor    # provider, env, MCP reachability
agentguard version
```

## Quickstart

```robot
*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}
```

The kitchen-sink `Library AgentGuard` exposes every keyword. Prefer narrow imports for bigger suites:

```robot
*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats
```

## Sub-library imports

| Import line | Purpose |
|---|---|
| `Library AgentGuard` | Kitchen-sink — every keyword reachable |
| `Library AgentGuard.MCP` | Test MCP servers (stdio / SSE / streamable-HTTP / in-memory) |
| `Library AgentGuard.Skill` | Discover, parse, validate, grade Agent Skills |
| `Library AgentGuard.Tool` | BFCL-style tool-call AST + trajectory matching |
| `Library AgentGuard.Stats` | Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N |
| `Library AgentGuard.Judge` | Classification-based LLM-as-Judge with Cohen's κ calibration |
| `Library AgentGuard.Security` | Default-deny skill scanner, redactor, sandbox, AIDefence |
| `Library AgentGuard.Hook` | Claude Code hook lifecycle (12 events × 4 handler types) |
| `Library AgentGuard.SubAgent` | A2A 1.0 task lifecycle, framework bridges |
| `Library AgentGuard.Coding` | Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack |
| `Library AgentGuard.Benchmark` | SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench |
| `Library AgentGuard.Scenario` | Unified scenario harness — drop-in for `manykarim/rf-mcp` `tests/e2e/` |

**Keyword documentation** (browsable HTML, generated by Robot Framework `libdoc`):

- **<https://manykarim.github.io/robotframework-agentguard/api/>** — full per-library API reference, served via GitHub Pages
- [`docs/KEYWORDS.md`](docs/KEYWORDS.md) — alphabetical text reference for all 147 keywords (Markdown, viewable on GitHub)
- [`docs/api/`](docs/api/) — the source HTML files (regenerated via `./docs/api/generate.sh`)

## Operator-driven assertions

Every collapsible Get-style keyword accepts the standard `(assertion_operator, assertion_expected, message)` parameters from [`robotframework-assertion-engine`](https://github.com/MarketSquare/AssertionEngine), so a single keyword does both retrieval and assertion:

```robot
Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}
```

Without operator arguments the same keywords just return the value. See [`examples/13_assertion_engine_idiom.robot`](examples/13_assertion_engine_idiom.robot) and [ADR-022](docs/adr/ADR-022-assertion-engine-adoption.md).

## Examples

Runnable Robot suites under [`examples/`](examples/):

| File | Topic |
|---|---|
| `01_mcp_server_basics.robot` | Connect to an MCP server, list and call tools |
| `02_skill_grading.robot` | Grade a `SKILL.md` against an LLM with Cohen's κ calibration |
| `03_hook_block_destructive.robot` | Synthesise hook events, assert blocking decisions |
| `04_subagent_a2a.robot` | A2A subagent task lifecycle + trajectory matching |
| `05_coding_agent_metrics.robot` | Compute the 12 #42796 behavioural metrics from a session |
| `06_bfcl_tool_selection.robot` | BFCL AST equality + trajectory comparison |
| `07_sandbox_run.robot` | Run untrusted code under default-deny Docker sandbox |
| `08_swe_bench.robot` | SWE-bench Verified loader + pass@k gate |
| `09_humaneval_live.robot` | HumanEval live grading |
| `10_rf_mcp_integration.robot` | Drop-in replacement for `manykarim/rf-mcp` e2e patterns |
| `11_agentskills_grading.robot` | Grade `manykarim/robotframework-agentskills` SKILL.md files |
| `12_mcp_scenario_replacement.robot` | YAML-driven scenarios + live LLM driver |
| `13_assertion_engine_idiom.robot` | Side-by-side: operator form vs old Should-pair form |
| `14_facade_imports.robot` | Side-by-side import variants |

Run any example:

```bash
PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot
```

## Architecture (12 bounded contexts)

| Context | Purpose |
|---|---|
| **Provider** | LiteLLM-backed `LLMProviderAdapter` + thin vendor adapters |
| **MCP** | FastMCP client wrapper for stdio / SSE / streamable-http / in-memory |
| **Skills** | `SKILL.md` discovery, frontmatter validation, Inspect-AI grading |
| **Hooks** | Synthesise the 12 Claude Code hook events; assert handler decisions |
| **SubAgents** | A2A 1.0 task lifecycle + delegation-chain assertions |
| **CodingAgent** | Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL |
| **Statistics** | scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N |
| **Judge** | Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7) |
| **Security** | Default-deny skill scanner, redactor, sandbox policy, AIDefence integration |
| **Telemetry** | OTel spans + Robot Framework listener embedding scorecards in `log.html` |
| **BehavioralMetrics** | The 12 calculators from `anthropic/claude-code#42796` |
| **ToolCallCorrectness** | BFCL AST/trajectory matcher used by MCP, Skills, SubAgents |

Aggregates, value objects, and ACLs are in [`docs/ddd/`](docs/ddd/).

## Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

| Surface | Budget |
|---|---|
| MCP in-memory roundtrip (p50 / p95) | ≤ 5 / 10 ms |
| MCP stdio roundtrip (p50) | ≤ 50 ms |
| BFCL AST match (mean per call) | ≤ 1 ms |
| `mannwhitneyu` n=30/30 | ≤ 5 ms |
| `bootstrap` n=30 / 1000 resamples | ≤ 100 ms |
| Library import + Suite Setup (cold) | ≤ 2 s |

Run the suite locally:

```bash
uv run pytest benchmarks/ --benchmark-only
```

Full budget table and cost model: [`docs/performance/`](docs/performance/).

## Documentation

- Plan — [`docs/PLAN.md`](docs/PLAN.md)
- Keyword reference — **<https://manykarim.github.io/robotframework-agentguard/api/>** (GitHub Pages) · [`docs/KEYWORDS.md`](docs/KEYWORDS.md) (Markdown)
- Architecture Decision Records — [`docs/adr/`](docs/adr/)
- Domain model — [`docs/ddd/`](docs/ddd/)
- Performance budgets — [`docs/performance/`](docs/performance/)
- Research dossier — [`docs/research/research.md`](docs/research/research.md)
- Contributing — [`CONTRIBUTING.md`](CONTRIBUTING.md)
- Security — [`SECURITY.md`](SECURITY.md)

## License

Apache-2.0 — see [LICENSE](LICENSE).
