Metadata-Version: 2.4
Name: mcpaisuite-evalmcp
Version: 1.0.3
Summary: Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, security benchmarks
License: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/gashel01/evalmcp
Project-URL: Repository, https://github.com/gashel01/evalmcp
Project-URL: Issues, https://github.com/gashel01/evalmcp/issues
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: mcp>=1.0
Provides-Extra: api
Requires-Dist: fastapi>=0.100; extra == "api"
Requires-Dist: uvicorn>=0.23; extra == "api"
Provides-Extra: llm
Requires-Dist: litellm>=1.0; extra == "llm"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Dynamic: license-file

# evalmcp

> Evaluation suite for MCP AI agents -- golden datasets, LLM-as-judge, security benchmarks

Part of the [MCP AI Suite](https://mcpaisuite.dev).

## Features

- **Built-in benchmark suites** -- 6 golden datasets: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge
- **Pluggable judges** -- exact match, contains, or LLM-as-judge for semantic evaluation
- **Library, CLI & MCP server** -- use it from Python, the `evalmcp` CLI, the `evalmcp-server` MCP server (3 tools: `list_suites`, `run_suite`, `evaluate`), or the `evalmcp-api` FastAPI service
- **Regression detection** -- compare successive runs to catch quality drops above a threshold
- **Standard metrics** -- accuracy, precision, recall, F1, per-tag breakdowns
- **HTML dashboard export** -- visual report of evaluation results
- **JSON and CSV export** -- machine-readable result output
- **Model comparison** -- side-by-side evaluation of different models or configurations
- **Persistent store** -- track evaluation runs over time for trend analysis

## Installation

```bash
pip install mcpaisuite-evalmcp          # runtime deps: click + mcp
pip install "mcpaisuite-evalmcp[api]"   # adds the FastAPI server (evalmcp-api)
```

## Quick Start

```python
from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(name="my_tests", cases=[
    EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
    EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
])

pipeline = EvalPipeline(judge="contains")
results = await pipeline.run_suite(suite)
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")
```

## CLI

```bash
# List available benchmark suites:
evalmcp list

# Run a benchmark suite:
evalmcp run memory_basic --judge contains

# CI mode with regression detection:
evalmcp run security --judge exact --ci --threshold 0.1

# Export HTML dashboard:
evalmcp run reasoning --html report.html
```

## Configuration

EvalPipeline is configured programmatically via constructor parameters.

| Parameter | Description |
|---|---|
| `kernel_pipeline` | Optional MCP kernel pipeline for generating outputs |
| `judge` | Judge strategy: `"exact"`, `"contains"`, `"llm"`, or a `BaseJudge` instance |
| `llm_fn` | Async callable `(prompt) -> str`, required when `judge="llm"` |
| `store` | Optional `EvalStore` for persistence and regression detection |

## API Reference

### EvalPipeline

Orchestrates evaluation of MCP agent outputs against expected results.

```python
pipeline = EvalPipeline(judge="llm", llm_fn=my_llm, store=EvalStore())
results = await pipeline.run_suite(suite) -> list[EvalResult]
result = await pipeline.run_case(case) -> EvalResult
summary = pipeline.summary(results) -> dict
metrics = pipeline.metrics(results) -> dict
regression = pipeline.detect_regression(suite_name, threshold=0.1) -> dict
```

### Utility Functions

```python
from evalmcp import export_json, export_csv, compute_metrics, generate_dashboard

export_json(results, summary, "results.json")
export_csv(results, "results.csv")
metrics = compute_metrics(results)
generate_dashboard(results, summary, metrics, output_path="report.html")
```

## Architecture

EvalPipeline iterates over EvalCases in a suite, optionally generating outputs via a kernel_pipeline, then scoring each result through a pluggable judge (ExactMatchJudge, ContainsJudge, or LLMJudge). Results are aggregated into summary statistics with per-tag breakdowns. EvalStore persists run history to enable regression detection by comparing the two most recent runs.

## Testing

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

## License

AGPL-3.0 — see [LICENSE](LICENSE).

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact [gaeldev@gmail.com](mailto:gaeldev@gmail.com).
