ATP Platform Architecture

Agent Test Platform - Framework-Agnostic Testing for AI Agents

Version 1.0

Architecture Overview

High-level view of ATP Platform and its interactions with external systems.

C4Context title ATP Platform - System Context Person(user, "Developer", "Creates test suites and reviews results") Person(ci, "CI/CD System", "Automated testing pipeline") System_Boundary(atp, "ATP Platform") { System(cli, "ATP CLI", "Command-line interface for running tests") System(dashboard, "Web Dashboard", "UI for viewing results and managing tests") System(runner, "Test Runner", "Orchestrates test execution") SystemDb(db, "SQLite DB", "Stores results and definitions") } System_Ext(agent, "AI Agent", "Agent being tested (black box)") System_Ext(testsite, "Test Site", "Target system for agent tasks") System_Ext(llm, "LLM Provider", "For LLM-as-Judge evaluation") Rel(user, cli, "Runs tests", "YAML config") Rel(user, dashboard, "Views results", "HTTP") Rel(ci, cli, "Automated tests", "CLI") Rel(cli, runner, "Executes") Rel(runner, agent, "ATP Protocol", "stdin/stdout or HTTP") Rel(agent, testsite, "Performs tasks", "HTTP") Rel(runner, db, "Stores results") Rel(dashboard, db, "Reads data") Rel(runner, llm, "LLM evaluation", "API")

Key Principle

Agent = Black Box with Contract

  • ATP doesn't care how the agent is implemented internally
  • Agents communicate via a standardized protocol (ATPRequest/ATPResponse)
  • Supports any framework: LangGraph, CrewAI, AutoGen, custom code

Data Flow

How test definitions flow through the system to produce evaluation results.

flowchart LR subgraph Input YAML[("Test Suite
(YAML)")] end subgraph "ATP Core" Loader["Loader
📄"] Runner["Runner
⚡"] Sandbox["Sandbox
🔒"] end subgraph Adapters CLI["CLI Adapter
stdin/stdout"] HTTP["HTTP Adapter
REST API"] Docker["Docker Adapter
Container"] end subgraph "Agent (Black Box)" Agent["AI Agent
🤖"] end subgraph Evaluation Evaluators["Evaluators
✓"] Scoring["Score
Aggregator"] end subgraph Output Report["Report
📊"] DB[("Database")] end YAML --> Loader Loader --> Runner Runner --> Sandbox Sandbox --> CLI & HTTP & Docker CLI & HTTP & Docker --> Agent Agent --> |"ATPResponse
+ Events"| Evaluators Evaluators --> Scoring Scoring --> Report Scoring --> DB style YAML fill:#E0E7FF,stroke:#4F46E5 style Agent fill:#DCFCE7,stroke:#10B981 style Report fill:#FEF3C7,stroke:#F59E0B style DB fill:#FCE7F3,stroke:#EC4899

Events Stream (stderr)

  • progress - Task progress updates
  • tool_call - Tool invocations
  • llm_request - LLM API calls
  • reasoning - Agent thoughts
  • error - Error notifications

Evaluator Types

  • artifact File existence, schema validation
  • behavior Tool usage, error handling
  • llm_judge Quality assessment by LLM
  • code_exec pytest, npm, lint checks

Protocol Messages

The ATP Protocol defines the contract between the platform and agents.

classDiagram class ATPRequest { +String version +String task_id +Task task +Constraints constraints +Context context } class Task { +String description +List expected_artifacts +Object metadata } class Constraints { +int max_steps +int timeout_seconds +List allowed_tools +List forbidden_tools } class Context { +Object environment +List artifacts +String working_directory } class ATPResponse { +String version +String task_id +Status status +List artifacts +Metrics metrics +String error } class Artifact { +String type +String path +String content +String content_type } class Metrics { +int total_steps +float wall_time_seconds +int tokens_used +float cost } class ATPEvent { +String version +String task_id +String timestamp +int sequence +String event_type +Object payload } class Status { <<enumeration>> completed partial failed timeout } ATPRequest --> Task ATPRequest --> Constraints ATPRequest --> Context ATPResponse --> Artifact ATPResponse --> Metrics ATPResponse --> Status Context --> Artifact

Communication Pattern

# CLI Adapter (stdin/stdout)
echo '{"task_id": "t1", "task": {"description": "Find laptops"}}' | python agent.py

# HTTP Adapter (REST API)
curl -X POST http://agent:8000/execute -d '{"task_id": "t1", ...}'

# Events stream on stderr (JSONL)
{"event_type": "progress", "payload": {"message": "Starting search", "percentage": 0}}
{"event_type": "tool_call", "payload": {"tool": "http_get", "status": "completed"}}

Adapter Types

Adapters translate between ATP Protocol and specific agent implementations.

flowchart TB subgraph "ATP Runner" Runner["Test Runner"] end subgraph "Built-in Adapters" CLI["CLI Adapter
stdin/stdout JSON"] HTTP["HTTP Adapter
REST API calls"] Docker["Docker Adapter
Container execution"] end subgraph "Framework Adapters" LG["LangGraph
State graphs"] CA["CrewAI
Multi-agent crews"] AG["AutoGen
Conversational agents"] end subgraph "Agent Implementations" PyAgent["Python Script
🐍"] Container["Docker Container
🐳"] Service["HTTP Service
🌐"] LGAgent["LangGraph Agent
📊"] CrewAgent["CrewAI Crew
👥"] AGAgent["AutoGen Agent
💬"] end Runner --> CLI & HTTP & Docker Runner --> LG & CA & AG CLI --> PyAgent Docker --> Container HTTP --> Service LG --> LGAgent CA --> CrewAgent AG --> AGAgent style Runner fill:#4F46E5,color:#fff style CLI fill:#E0E7FF,stroke:#4F46E5 style HTTP fill:#E0E7FF,stroke:#4F46E5 style Docker fill:#E0E7FF,stroke:#4F46E5 style LG fill:#DCFCE7,stroke:#10B981 style CA fill:#DCFCE7,stroke:#10B981 style AG fill:#DCFCE7,stroke:#10B981

CLI Adapter Usage

agents:
  - name: "search-agent"
    type: "cli"
    config:
      command: "python"
      args: ["agent.py"]
      environment:
        TEST_SITE_URL: "http://localhost:9876"

Docker Adapter Usage

agents:
  - name: "containerized-agent"
    type: "docker"
    config:
      image: "atp-search-agent:latest"
      network: "host"
      environment:
        TEST_SITE_URL: "http://localhost:9876"

Component Structure

Internal modules of the ATP Platform.

flowchart TB subgraph "atp/" subgraph "Core Modules" protocol["protocol/
Request, Response, Event models"] core["core/
Config, exceptions, security"] end subgraph "Execution" loader["loader/
YAML/JSON parsing"] runner["runner/
Test orchestration, sandbox"] adapters["adapters/
CLI, HTTP, Docker, frameworks"] end subgraph "Evaluation" evaluators["evaluators/
Artifact, behavior, LLM-judge"] scoring["scoring/
Score aggregation"] baseline["baseline/
Regression detection"] end subgraph "Output" reporters["reporters/
Console, JSON, HTML, JUnit"] dashboard["dashboard/
FastAPI web interface"] end subgraph "Infrastructure" streaming["streaming/
Event buffering"] statistics["statistics/
Mean, CI, stability"] cli_mod["cli/
CLI entry point"] end end cli_mod --> loader loader --> runner runner --> adapters adapters --> protocol runner --> evaluators evaluators --> scoring scoring --> baseline scoring --> reporters scoring --> dashboard runner --> streaming baseline --> statistics style protocol fill:#E0E7FF,stroke:#4F46E5 style runner fill:#DCFCE7,stroke:#10B981 style evaluators fill:#FEF3C7,stroke:#F59E0B style dashboard fill:#FCE7F3,stroke:#EC4899

Evaluator Registry

Evaluator Assertion Types
artifact artifact_exists, contains, schema, sections
behavior must_use_tools, max_tool_calls, no_errors, forbidden_tools
llm_judge llm_eval
code_exec pytest, npm, custom_command, lint
security security

Example: Web Search Agent Test

Complete flow of testing the search agent against the test site.

sequenceDiagram autonumber participant User participant CLI as ATP CLI participant Runner participant Adapter as CLI Adapter participant Agent as Search Agent participant Site as Test Site
(port 9876) participant Eval as Evaluators participant DB as Dashboard DB User->>CLI: atp test web_search.yaml CLI->>Runner: Load test suite loop For each test Runner->>Adapter: Execute test Adapter->>Agent: ATPRequest (stdin) Agent->>Site: GET /api/products?category=laptop Site-->>Agent: JSON product list Agent-->>Adapter: ATPEvent (stderr): progress Agent-->>Adapter: ATPEvent (stderr): tool_call Agent-->>Adapter: ATPResponse (stdout) Adapter-->>Runner: Response + Events Runner->>Eval: Evaluate results Eval-->>Runner: Assertion results end Runner->>DB: Store results Runner-->>CLI: Test report CLI-->>User: Summary + Details Note over User,DB: Results visible in Dashboard at localhost:8080

Test Site (port 9876)

  • / - Homepage
  • /catalog - Product catalog (HTML)
  • /api/products - Products API (JSON)
  • /api/company - Company info
  • /about, /contact - Static pages
podman run -d -p 9876:9876 atp-test-site

Search Agent

  • Parses task description for parameters
  • Fetches data via HTTP (JSON API or HTML scraping)
  • Emits progress events to stderr
  • Returns structured results as artifacts
podman run -i --network=host atp-search-agent