Metadata-Version: 2.4
Name: atp-platform
Version: 2.0.0
Summary: Framework-agnostic platform for testing and evaluating AI agents
Project-URL: Homepage, https://github.com/atp-platform/atp-platform
Project-URL: Repository, https://github.com/atp-platform/atp-platform
Project-URL: Documentation, https://github.com/atp-platform/atp-platform
Project-URL: Changelog, https://github.com/atp-platform/atp-platform/blob/main/CHANGELOG.md
Author: ATP Platform Contributors
License-File: LICENSE
Keywords: ai-agents,evaluation,framework-agnostic,game-theory,testing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.12
Requires-Dist: aiosqlite>=0.22.1
Requires-Dist: atp-adapters>=1.0.0
Requires-Dist: atp-core>=1.0.0
Requires-Dist: authlib>=1.6.9
Requires-Dist: bcrypt>=5.0.0
Requires-Dist: click>=8.3.1
Requires-Dist: greenlet>=3.3.1
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1.6
Requires-Dist: pillow>=12.1.1
Requires-Dist: pyjwt>=2.12.0
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: python-multipart>=0.0.22
Requires-Dist: rich>=13.0
Requires-Dist: sqlalchemy>=2.0.46
Provides-Extra: all
Requires-Dist: anthropic>=0.76.0; extra == 'all'
Requires-Dist: atp-adapters[cloud]; extra == 'all'
Requires-Dist: atp-dashboard>=1.0.0; extra == 'all'
Requires-Dist: atp-dashboard[analytics]; extra == 'all'
Requires-Dist: atp-dashboard[enterprise]; extra == 'all'
Requires-Dist: rich>=13.0; extra == 'all'
Requires-Dist: textual>=0.47.0; extra == 'all'
Provides-Extra: analytics
Requires-Dist: atp-dashboard[analytics]; extra == 'analytics'
Provides-Extra: azure-openai
Requires-Dist: atp-adapters[azure-openai]; extra == 'azure-openai'
Provides-Extra: bedrock
Requires-Dist: atp-adapters[bedrock]; extra == 'bedrock'
Provides-Extra: cloud
Requires-Dist: atp-adapters[cloud]; extra == 'cloud'
Provides-Extra: dashboard
Requires-Dist: atp-dashboard>=1.0.0; extra == 'dashboard'
Provides-Extra: dev
Requires-Dist: pytest-anyio>=0.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: enterprise
Requires-Dist: atp-dashboard[enterprise]; extra == 'enterprise'
Provides-Extra: llm
Requires-Dist: anthropic>=0.76.0; extra == 'llm'
Provides-Extra: tui
Requires-Dist: rich>=13.0; extra == 'tui'
Requires-Dist: textual>=0.47.0; extra == 'tui'
Provides-Extra: vertex
Requires-Dist: atp-adapters[vertex]; extra == 'vertex'
Description-Content-Type: text/markdown

# ATP — Agent Test Platform

**The framework-agnostic platform for testing and evaluating AI agents.**

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Coverage](https://img.shields.io/badge/coverage-80%25+-green.svg)](https://github.com/andrei-shtanakov/atp-platform)

## Why ATP?

- **Framework-agnostic** — test any agent (LangGraph, CrewAI, AutoGen, HTTP endpoint, CLI, container, cloud) through a single unified protocol. No vendor lock-in.
- **Game-theoretic evaluation** — the only platform with built-in multi-agent game evaluation: Prisoner's Dilemma, Public Goods, Auction, Colonel Blotto, Congestion Game, and more. Measure strategic reasoning, cooperation, and equilibrium play.
- **Statistical rigor** — multiple runs per test, 95% confidence intervals, Welch's t-test regression detection, and Elo ratings. Know when a change is real, not noise.
- **Production-ready** — web dashboard, SQLite/PostgreSQL storage, JUnit XML for CI/CD, HTML reports, cost tracking, and security evaluation out of the box.

## Quick Start

```bash
uv add atp-platform
atp quickstart
```

See the [Quick Start Guide](docs/guides/quickstart.md) for a full walkthrough.

## Quick Start (from source)

```bash
git clone https://github.com/andrei-shtanakov/atp-platform.git
cd atp-platform
uv sync --group dev
uv run pytest tests/ -v  # verify installation
```

### Your First Test Suite

Create a test suite file `my_tests.yaml`:

```yaml
test_suite: "my_first_suite"
version: "1.0"
description: "My first ATP test suite"

defaults:
  runs_per_test: 3
  timeout_seconds: 180

agents:
  - name: "my-agent"
    type: "http"
    config:
      endpoint: "http://localhost:8000"

tests:
  - id: "test-001"
    name: "Basic file creation test"
    tags: ["smoke", "basic"]
    task:
      description: "Create a file named output.txt with content 'Hello, ATP!'"
      expected_artifacts: ["output.txt"]
    constraints:
      max_steps: 5
      timeout_seconds: 60
    assertions:
      - type: "artifact_exists"
        config:
          path: "output.txt"
      - type: "llm_eval"
        config:
          criteria: "completeness"
          threshold: 0.8
```

## Features

### Core Platform

✅ **Test Runner** - Full test orchestration with parallel execution
- Single test and suite execution
- Configurable parallelism (`--parallel`)
- Timeout enforcement (soft and hard)
- Progress reporting and fail-fast mode

✅ **Agent Adapters** - Connect to any agent type
- **HTTPAdapter** - REST/SSE endpoints
- **ContainerAdapter** - Docker-based agents
- **CLIAdapter** - Command-line agents
- **LangGraphAdapter** - Native LangGraph integration
- **CrewAIAdapter** - CrewAI framework support
- **AutoGenAdapter** - AutoGen framework support
- **MCPAdapter** - Model Context Protocol (MCP) tools/resources
- **BedrockAdapter** - AWS Bedrock integration
- **VertexAdapter** - Google Vertex AI integration
- **AzureOpenAIAdapter** - Azure OpenAI integration
- **SDKAdapter** - Pull-model adapter for SDK-based benchmark participants

✅ **Evaluators** - Multi-level result assessment
- **ArtifactEvaluator** - File existence, content, schema validation
- **BehaviorEvaluator** - Tool usage, step limits, error checks
- **LLMJudgeEvaluator** - Semantic evaluation via Claude
- **CodeExecEvaluator** - Run generated code (pytest, npm, custom)
- **SecurityEvaluator** - PII detection, secret leaks, code safety, prompt injection
- **FactualityEvaluator** - Claim extraction, citation checking, hallucination detection
- **StyleEvaluator** - Tone analysis, readability, formatting compliance
- **FilesystemEvaluator** - Workspace file existence, content, directory checks
- **PerformanceEvaluator** - Latency, throughput, regression detection
- **CompositeEvaluator** - Boolean logic (AND/OR/NOT) over nested assertions
- **GitCommitEvaluator** - Git commit message and diff analysis
- **GuardrailsEvaluator** - Custom guardrails enforcement
- **ContainerEvaluator** - Isolated code execution via Docker/Podman with resource limits

✅ **Reporters** - Multiple output formats
- **Console** - Colored terminal output with progress
- **JSON** - Structured results for automation
- **HTML** - Self-contained visual reports with charts
- **JUnit XML** - CI/CD integration (Jenkins, GitHub, GitLab)
- **GameReporter / GameHTMLReporter** - Game-theoretic evaluation results

### Advanced Features

✅ **Statistical Analysis** - Reliable metrics
- Multiple runs per test
- Mean, std, median, min/max
- 95% confidence intervals (t-distribution)
- Stability assessment

✅ **Baseline & Regression Detection**
- Save baseline results
- Compare runs with Welch's t-test
- Detect regressions (p < 0.05)
- Visual diff in console/JSON

✅ **CI/CD Integration**
- GitHub Actions workflow
- GitLab CI template
- Azure Pipelines, CircleCI, Jenkins examples
- Exit codes: 0=success, 1=failures, 2=error
- **Deploy pipeline** (`.github/workflows/deploy.yml`) — SSH deploy via `[deploy]` tag in commit message or `workflow_dispatch`

✅ **Web Dashboard**
- FastAPI backend with HTMX + Pico CSS frontend at `/ui/`
- Results storage (SQLite/PostgreSQL)
- Working UI pages: Benchmarks (upload + create), Runs (list + detail page with HTMX auto-refresh), Leaderboard (benchmark filter), Games (registry + tournaments), Suites (upload YAML), Analytics (stats + agent rankings)
- GitHub OAuth login, Device Flow for CLI auth, JWT tokens, RBAC

✅ **Platform API & SDK**
- **Benchmark API** (`/api/v1/benchmarks`, `/api/v1/runs`) - Pull-model benchmark execution with leaderboard
- **Tournament API** (`/api/v1/tournaments`) - Game-theoretic tournament management
- **Auth** - GitHub OAuth (OIDC) + Device Flow for CLI login + JWT tokens
- **RBAC** - Role-based access control with auto-admin for first user
- **Python SDK v2.0.0** (`atp-platform-sdk` on PyPI) - `AsyncATPClient` + sync `ATPClient` wrapper, `BenchmarkRun` async/sync iteration with `submit_sync()`/`status_sync()`/`cancel_sync()`/`emit_sync()`, `next_batch(n)` batch API, `emit()` event streaming, exponential-backoff retry, Device Flow auth
- **Dashboard UI** - HTMX + Pico CSS frontend at `/ui/` (benchmarks, games, runs, leaderboard, suites, analytics)
- **YAML Upload** (`POST /api/suite-definitions/upload`) - upload and validate test suites server-side
- **Rate Limiting** - Per-endpoint HTTP rate limiting via slowapi (configurable via `ATP_RATE_LIMIT_*` env vars)
- **Webhooks** - HTTP POST notifications on run completion/failure with SSRF protection and retry
- **Event Streaming** (`POST /api/v1/runs/{id}/events`) - Append events to running benchmark runs (max 1000/run)

## Project Structure

```
atp-platform/
├── atp/                      # Main package
│   ├── cli/                  # CLI commands (test, validate, baseline, dashboard, game, etc.)
│   ├── core/                 # Config, exceptions, security
│   ├── protocol/             # ATP Request/Response/Event models
│   ├── loader/               # YAML/JSON test parsing
│   ├── runner/               # Test orchestration, sandbox
│   ├── adapters/             # Agent adapters (HTTP, Docker, CLI, LangGraph, CrewAI, AutoGen, MCP, Bedrock, Vertex, Azure OpenAI, SDK/pull-model)
│   ├── evaluators/           # Result evaluation (artifact, behavior, LLM, code, security, factuality, style, performance, git-commit, guardrails, container)
│   ├── scoring/              # Score aggregation
│   ├── statistics/           # Statistical analysis
│   ├── baseline/             # Baseline management, regression detection
│   ├── reporters/            # Output formatting (console, JSON, HTML, JUnit, game)
│   ├── streaming/            # Event streaming support
│   ├── mock_tools/           # Mock tool server for testing
│   ├── performance/          # Profiling, caching, optimization
│   ├── dashboard/            # Web interface (FastAPI)
│   ├── analytics/            # Cost tracking and analytics
│   ├── benchmarks/           # Benchmark suites
│   ├── chaos/                # Chaos testing
│   ├── generator/            # Test suite generation
│   ├── plugins/              # Plugin ecosystem management
│   ├── sdk/                  # Python SDK for programmatic use
│   ├── tracing/              # Agent replay and trace management
│   └── tui/                  # Terminal user interface (optional)
├── packages/                  # Extracted packages (uv workspace members)
│   ├── atp-core/             # Protocol, core, loader, scoring, statistics
│   ├── atp-adapters/         # All agent adapters
│   ├── atp-dashboard/        # Web dashboard + benchmark/tournament API
│   └── atp-sdk/              # Python SDK for benchmark platform participants
├── game-environments/        # Standalone game theory library (Phase 5)
│   └── game_envs/            # Games, strategies, analysis (Nash, exploitability)
├── atp-games/                # ATP plugin for game-theoretic evaluation (Phase 5)
│   └── atp_games/            # GameRunner, evaluators, YAML suites, tournaments
├── docs/                     # Documentation
├── examples/                 # Example test suites and CI configs
│   ├── test_suites/          # Sample test suites
│   ├── games/                # Game-theoretic evaluation examples
│   ├── docker/               # Docker deployment examples
│   └── ci/                   # CI/CD templates
├── tests/                    # Test suite (80%+ coverage)
│   ├── unit/                 # Unit tests
│   ├── integration/          # Integration tests
│   ├── contract/             # Protocol contract tests
│   ├── e2e/                  # End-to-end tests
│   └── fixtures/             # Test fixtures
├── spec/                     # Working directory for specifications (managed by /spec-generator-skill)
│   ├── requirements.md       # Phase 4 feature requirements (REQ-XXX)
│   ├── phase5-requirements.md # Phase 5 game-theoretic requirements
│   ├── design.md             # Phase 4 technical design (DESIGN-XXX)
│   ├── phase5-design.md      # Phase 5 technical design
│   ├── tasks.md              # Phase 4 implementation tasks (TASK-XXX)
│   ├── phase5-tasks.md       # Phase 5 implementation tasks
│   └── WORKFLOW.md           # Task management workflow guide
└── pyproject.toml            # Project configuration
```

## CLI Commands

```bash
# Run tests with CLI adapter
uv run atp test <suite.yaml> --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]'

# Run tests with HTTP adapter
uv run atp test <suite.yaml> --adapter=http \
  --adapter-config='endpoint=http://localhost:8000'

# Run with multiple iterations and parallel execution
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --runs=5 --parallel=4

# Filter by tags
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --tags=smoke,core

# Output formats
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=json --output-file=results.json

uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=junit --output-file=results.xml

# Pass environment variables (for API keys)
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --adapter-config='inherit_environment=true' \
  --adapter-config='allowed_env_vars=["OPENAI_API_KEY","ANTHROPIC_API_KEY"]'

# Validate test definitions
uv run atp validate --suite=suite.yaml

# Baseline management
uv run atp baseline save suite.yaml -o baseline.json --runs=5
uv run atp baseline compare suite.yaml -b baseline.json

# Utilities
uv run atp list-agents          # List available adapters
uv run atp version              # Show version
uv run atp list suite.yaml      # List tests in a suite

# Additional commands
uv run atp init                 # Initialize ATP project
uv run atp generate             # Generate test suites
uv run atp benchmark            # Run benchmarks
uv run atp budget               # Budget management
uv run atp experiment           # Run experiments
uv run atp plugins              # Manage plugins
uv run atp game suite.yaml      # Game-theoretic evaluation
uv run atp catalog              # Browse and run tests from the catalog
uv run atp tui                  # Terminal user interface
uv run atp compare              # Multi-model comparison
uv run atp estimate             # Cost estimation
uv run atp traces               # Trace management
uv run atp replay               # Replay agent traces
uv run atp trend                # Cross-run trend analysis (regression detection)

# Suite sync (push/pull/sync YAML test suites to/from remote server)
uv run atp push suite.yaml --server=https://atp.example.com  # Upload YAML to server
uv run atp pull --server=https://atp.example.com             # Download suites from server
uv run atp sync                                               # Sync local suites with remote
```

## Documentation

### Getting Started

- [Installation Guide](docs/guides/installation.md) - Setup and dependencies
- [Quick Start Guide](docs/guides/quickstart.md) - First test suite
- [Basic Usage](docs/guides/usage.md) - Common workflows
- [Agent Testing Guide](https://github.com/andrei-shtanakov/atp-platform-testing) - Step-by-step test planning, examples, and a runnable lab

### Reference

- [Test Format Reference](docs/reference/test-format.md) - YAML structure specification
- [Adapter Configuration](docs/reference/adapters.md) - Configure agent adapters
- [Configuration Reference](docs/reference/configuration.md) - All config options
- [API Reference](docs/reference/api-reference.md) - Python API
- [Dashboard API Reference](docs/reference/dashboard-api.md) - REST API for comparison, leaderboard, timeline
- [Troubleshooting](docs/reference/troubleshooting.md) - Common issues and solutions

### Architecture

- [Vision & Goals](docs/01-vision.md) - Project vision
- [Requirements](docs/02-requirements.md) - Functional requirements
- [Architecture](docs/03-architecture.md) - System architecture
- [ATP Protocol](docs/04-protocol.md) - Protocol specification
- [Evaluation System](docs/05-evaluators.md) - Metrics and evaluation
- [Integration Guide](docs/06-integration.md) - Agent integration
- [Roadmap](docs/07-roadmap.md) - Project roadmap and milestones
- [CI/CD Integration](docs/ci-cd.md) - CI/CD setup
- [Security](docs/security.md) - Security model
- [Architecture Decision Records](docs/adr/) - Key design decisions

### Game-Theoretic Evaluation

- [game-environments README](game-environments/README.md) - Game library: API, game dev guide, strategies, analysis tools
- [atp-games README](atp-games/README.md) - ATP plugin: quick start, YAML reference, evaluators, tournaments
- [Game rules (human-readable)](docs/games/rules/) - Per-game rule sheets derived from the implementation (currently El Farol in EN/RU)

### Examples

See [examples/](examples/) for:
- [Test Suites](examples/test_suites/) - Sample test definitions
- [Game Examples](examples/games/) - Game-theoretic evaluation ([README](examples/games/README.md), no API keys needed):
  - `basic_usage.py` - Run games, strategies, and tournaments
  - `custom_game.py` - Create a new game from scratch
  - `llm_agent_eval.py` - Evaluate agents on game battery
  - `population_dynamics.py` - Evolutionary simulation
- [CI/CD Templates](examples/ci/) - GitHub Actions, GitLab CI, Jenkins, Azure, CircleCI
- [Demo Agents](examples/) - Ready-to-run example agents:
  - `demo_agent.py` - Simple file operations agent (no API keys needed)
  - `openai_agent.py` - OpenAI-powered agent with tool calling
  - `run_demo.sh` / `run_openai_demo.sh` - Quick start scripts

## Development

### Commands

```bash
# Testing
uv run pytest tests/ -v --cov=atp --cov-report=term-missing  # All tests with coverage
uv run pytest tests/unit -v                                   # Unit tests only
uv run pytest tests/ -v -m "not slow"                        # Fast tests

# Code quality
uv run ruff format .               # Format code
uv run ruff check .                # Lint check
uv run ruff check . --fix          # Auto-fix lint issues
uv run pyrefly check               # Type checking

# Task management
python task.py list                # List all tasks
python task.py next                # Show ready tasks
```

### Code Style

- Python 3.12+
- Type hints required for all code
- Line length: 88 characters
- Use Pydantic for data models
- Docstrings for public APIs
- Test coverage ≥80%

See [CLAUDE.md](CLAUDE.md) for detailed development guidelines.

### macOS launchers

For Mac users who prefer double-clicking over the CLI, see [`scripts/macos/`](scripts/macos/) — double-clickable `.command` files that install dependencies and run bundled game suites (Prisoner's Dilemma, Auction, El Farol).

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Write tests for new functionality
4. Ensure all tests pass and code is formatted
5. Submit a pull request

See [CLAUDE.md](CLAUDE.md) for code style and development workflow.

## License

MIT License - see [LICENSE](LICENSE) for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/andrei-shtanakov/atp-platform/issues)
- **Documentation**: [docs/](docs/)
- **Examples**: [examples/](examples/)

## Phase 5: Game-Theoretic Evaluation

ATP includes a game-theoretic evaluation framework for testing agent strategic reasoning, cooperation, and equilibrium play in multi-agent games.

### Packages

| Package | Description | Docs |
|---|---|---|
| [`game-environments`](game-environments/) | Standalone game theory library (zero ATP dependency) | [README](game-environments/README.md) |
| [`atp-games`](atp-games/) | ATP plugin for game-theoretic evaluation | [README](atp-games/README.md) |
| [`atp-platform-sdk`](packages/atp-sdk/) | Python SDK for benchmark participants | [README](packages/atp-sdk/README.md) |

### Built-in Games

Eight canonical games with known Nash equilibria for rigorous evaluation:

- **Prisoner's Dilemma** -- cooperation vs defection with configurable payoff matrix
- **Stag Hunt** -- trust vs safety, two pure Nash equilibria
- **Battle of the Sexes** -- coordination under conflicting preferences
- **Public Goods Game** -- N-player contribution with multiplier and optional punishment
- **Auction** -- first-price and second-price sealed-bid with private values
- **Colonel Blotto** -- resource allocation across multiple battlefields
- **Congestion Game** -- network routing with latency-dependent costs
- **El Farol Bar** -- bounded rationality and minority game dynamics

### Game-Theoretic Evaluators

- **PayoffEvaluator** -- average payoff, distribution, social welfare, Pareto efficiency
- **ExploitabilityEvaluator** -- best-response gap, empirical strategy extraction
- **CooperationEvaluator** -- cooperation rate, conditional cooperation, reciprocity
- **EquilibriumEvaluator** -- Nash distance, convergence detection, equilibrium classification

### Quick Start (Games)

```bash
# Run a built-in game suite
uv run atp test --suite=game:prisoners_dilemma.yaml

# Or use programmatically
```

```python
from game_envs import PrisonersDilemma, PDConfig, TitForTat, AlwaysDefect
from atp_games import GameRunner, GameRunConfig, BuiltinAdapter
import asyncio

async def main():
    game = PrisonersDilemma(PDConfig(num_rounds=50))
    agents = {
        "player_0": BuiltinAdapter(TitForTat()),
        "player_1": BuiltinAdapter(AlwaysDefect()),
    }
    runner = GameRunner()
    result = await runner.run_game(
        game=game, agents=agents,
        config=GameRunConfig(episodes=20, base_seed=42),
    )
    print(result.average_payoffs)

asyncio.run(main())
```

See [examples/games/](examples/games/) for more examples.

## Status

**Current Status**: GA (General Availability)

All core features implemented:
- ✅ MVP: Protocol, Adapters, Runner, Evaluators, Reporters, CLI
- ✅ Beta: Framework adapters, Statistics, LLM-Judge, Baseline, HTML reports, CI/CD
- ✅ GA: Dashboard, Security hardening, Performance optimization
- ✅ Phase 5: Game-theoretic evaluation (game-environments + atp-games)
- ✅ Platform API & SDK: Benchmark/Tournament REST API, GitHub OAuth, Device Flow, Python SDK (`atp-platform-sdk`)

### Specifications Directory

The `spec/` directory is a **working directory** for current development specifications, managed by the `/spec-generator-skill` Claude skill. It contains:
- `requirements.md` — Feature requirements in Kiro-style format (REQ-XXX)
- `design.md` — Technical design and architecture (DESIGN-XXX)
- `tasks.md` — Implementation tasks with dependencies (TASK-XXX)
- `WORKFLOW.md` — Task management and executor workflow guide

Specifications evolve with the project. See [spec/tasks.md](spec/tasks.md) for current task status.
