Metadata-Version: 2.4
Name: acp-evals
Version: 1.0.0
Summary: Production-grade evaluation framework for Agent Communication Protocol (ACP) agents
Project-URL: Homepage, https://github.com/jbarnes850/acp-evals
Project-URL: Documentation, https://github.com/jbarnes850/acp-evals/tree/main/docs
Project-URL: Repository, https://github.com/jbarnes850/acp-evals
Project-URL: Issues, https://github.com/jbarnes850/acp-evals/issues
Author-email: Jarrod Barnes <jbarnes850@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: acp,agents,ai,evaluation,llm,performance,quality,reliability,testing
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: System :: Benchmark
Requires-Python: >=3.11
Requires-Dist: acp-sdk>=0.1.0
Requires-Dist: asyncio>=3.4.3
Requires-Dist: click>=8.0.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: opentelemetry-api>=1.20.0
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20.0
Requires-Dist: opentelemetry-sdk>=1.20.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tiktoken>=0.5.0
Provides-Extra: all-providers
Requires-Dist: anthropic>=0.21.0; extra == 'all-providers'
Requires-Dist: openai>=1.0.0; extra == 'all-providers'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.21.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pyright>=1.1.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Description-Content-Type: text/markdown

# ACP Evals

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
[![ACP Compatible](https://img.shields.io/badge/ACP-compatible-brightgreen.svg)](https://github.com/i-am-bee/acp)

**ACP Evals is an open framework for evaluating AI agents across accuracy, performance, and reliability dimensions.**

Modern AI agents need comprehensive testing before deployment. ACP Evals provides production-grade evaluation using LLM-as-judge methodology, designed to integrate seamlessly with the BeeAI ecosystem and any ACP-compliant agent.

ACP Evals enables you to:
- Measure response accuracy using configurable LLM judges
- Track performance metrics including latency and memory usage
- Validate tool usage patterns and error handling
- Run batch evaluations for comprehensive test coverage
- Generate detailed reports for continuous improvement

## Core Concepts

| **Concept** | **Description** |
|-------------|------------------|
| **Accuracy** | Evaluates response quality against expected outputs using LLM-as-judge methodology. Supports custom rubrics for domain-specific evaluation. |
| **Performance** | Measures latency, memory usage, and token efficiency. Essential for production deployments where speed and resource constraints matter. |
| **Reliability** | Validates tool usage patterns, error handling, and consistency across runs. Critical for agents that interact with external systems. |

## Quick Example

Evaluate agent accuracy with just a few lines:

```python calculate_accuracy.py
from acp_evals import AccuracyEval

evaluation = AccuracyEval(
    agent="http://localhost:8001/agents/my-agent",
    rubric="factual"
)

result = await evaluation.run(
    input="What is 10*5 then to the power of 2? do it step by step",
    expected="2500",
    print_results=True
)
assert result is not None and result.score >= 0.7
```

## Core Features

- **[Comprehensive Evaluation](./examples/comprehensive_eval.py)** - Run all three evaluation dimensions in a single command
- **[Rich TUI Display](./src/acp_evals/cli/display.py)** - Interactive terminal UI with detailed metrics and LLM judge explanations
- **[Batch Testing](./docs/api-reference.md#batch-evaluation)** - Evaluate multiple test cases with parallel execution
- **[Multiple Provider Support](./docs/providers.md)** - Works with OpenAI, Anthropic, Ollama, and more
- **[Export Capabilities](./docs/api-reference.md#result-objects)** - Generate JSON reports for CI/CD integration

## Installation

```bash
pip install acp-evals
```

## Quickstart

**1. Configure your LLM provider**

```bash
echo "OPENAI_API_KEY=your-key-here" > .env
acp-evals check
```

**2. Run your first evaluation**

```bash
# Test accuracy
acp-evals run accuracy http://localhost:8001/agents/my-agent \
  -i "What is 2+2?" -e "4"
```

**3. Run comprehensive evaluation**

```bash
acp-evals comprehensive http://localhost:8001/agents/my-agent \
  -i "Calculate compound interest" -e "Detailed calculation"
```

## Examples

### Performance Evaluation

```python
from acp_evals import PerformanceEval

evaluation = PerformanceEval(
    agent="http://localhost:8001/agents/my-agent",
    num_iterations=5,
    track_memory=True
)

result = await evaluation.run(
    input_text="What is the capital of France?",
    print_results=True
)
```

### Reliability Evaluation

```python
from acp_evals import ReliabilityEval

evaluation = ReliabilityEval(
    agent="http://localhost:8001/agents/my-agent",
    tool_definitions=["search", "calculator"]
)

result = await evaluation.run(
    input="Search for AAPL price and calculate P/E ratio",
    expected_tools=["search", "calculator"],
    print_results=True
)
assert result.passed
```

## Agent Formats

ACP Evals works with any agent implementation:

- **ACP-compliant agents**: `http://localhost:8001/agents/my-agent`
- **Python functions**: `agent.py:function_name`
- **Python modules**: `mymodule.agent_function`

## CLI Reference

```bash
# Check setup
acp-evals check

# Run evaluations
acp-evals run accuracy <agent> -i <input> -e <expected>
acp-evals run performance <agent> -i <input>
acp-evals run reliability <agent> -i <input> --expected-tools <tool>

# Comprehensive testing
acp-evals comprehensive <agent> -i <input> -e <expected>

# Batch testing
acp-evals run accuracy <agent> --test-file tests.jsonl
```

## Resources

- **[Documentation](./docs)** - API reference and guides
- **[Examples](./examples)** - Ready-to-run code samples
- **[Issues](https://github.com/i-am-bee/acp-evals/issues)** - Report bugs or request features

## License

Apache 2.0 - see [LICENSE](./LICENSE)

---

Developed by contributors to the BeeAI project, this initiative is part of the [Linux Foundation AI & Data program](https://lfaidata.foundation/projects/). Its development follows open, collaborative, and community-driven practices.