Metadata-Version: 2.4
Name: agentqualify
Version: 0.2.4
Summary: Open-source toolkit for evaluating AI agents using AWS Bedrock AgentCore Evaluations
Author: AgentQualify Contributors
License-Expression: MIT-0
Project-URL: Homepage, https://github.com/agentqualify/agentqualify
Project-URL: Repository, https://github.com/agentqualify/agentqualify
Project-URL: Issues, https://github.com/agentqualify/agentqualify/issues
Project-URL: Documentation, https://github.com/agentqualify/agentqualify#readme
Project-URL: Changelog, https://github.com/agentqualify/agentqualify/releases
Keywords: aws,bedrock,agentcore,evaluation,agent,llm,ci-cd,ground-truth
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3>=1.34.0
Requires-Dist: click>=8.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.0.0
Provides-Extra: agentcore-sdk
Requires-Dist: bedrock-agentcore; extra == "agentcore-sdk"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: moto[cloudwatch,s3]>=5.0; extra == "dev"
Requires-Dist: pytest-mock; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"
Requires-Dist: bandit>=1.8.0; extra == "dev"
Requires-Dist: safety>=3.0.0; extra == "dev"
Requires-Dist: pip-audit>=2.7.0; extra == "dev"
Dynamic: license-file

# AgentQualify

Open-source Python toolkit for evaluating AI agents using [AWS Bedrock AgentCore Evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html). Run evaluations, compare models, gate CI/CD pipelines, and visualize results in CloudWatch dashboards — all from a single YAML config.

## Why AgentQualify?

### The problem without it

Teams building agents on AgentCore today face a painful integration gap. The Evaluate API exists, but turning it into a repeatable development workflow requires solving every piece yourself:

- **Manual glue code for every project.** You need to write scripts that invoke your agent, wait for CloudWatch to ingest spans, query the right log groups, parse OTEL traces, construct the `evaluationReferenceInputs` payload, call the Evaluate API per evaluator, and aggregate the results. That's hundreds of lines of boilerplate before you've evaluated a single test case.
- **No standard way to define test suites.** Without a shared format, every team invents their own way to store test prompts, expected responses, and assertions. These ad-hoc scripts are fragile, hard to review in PRs, and impossible to share across teams.
- **No CI/CD integration out of the box.** The Evaluate API returns scores, but it doesn't tell your pipeline to pass or fail. You need to build threshold checking, regression detection against a baseline, and exit code logic yourself — and get it right for every agent.
- **Model comparison requires duplicated effort.** Evaluating the same agent across Claude Sonnet, Haiku, and Nova Pro means tripling your invocation and evaluation code, then manually aligning results for comparison.
- **Ground truth is hard to wire up correctly.** The `evaluationReferenceInputs` schema has different scoping rules for `expectedResponse` (trace-level), `assertions` (session-level), and `expectedTrajectory` (session-level). Getting the payload structure wrong means silent evaluation failures or misleading scores.
- **No regression tracking.** Without a baseline stored somewhere and compared automatically, you can't answer "did this prompt change make the agent worse?" — the most common question in agent development.
- **Reporting is an afterthought.** Scores come back as raw JSON. Building CloudWatch dashboards with model comparisons, per-test breakdowns, and trend lines is a separate project entirely.

The result: most teams either skip automated evaluation entirely or build fragile one-off scripts that break on the next agent update. Quality becomes a manual spot-check instead of a continuous signal.

### What AgentQualify gives you

AgentQualify wraps that entire lifecycle into a single `pip install` and one YAML file:

- **Go from zero to CI/CD quality gate in minutes.** Define your test cases, set score thresholds, and run `agentqualify run`. The CLI exits 0 on pass, 1 on failure — plug it into any pipeline.
- **Compare models without changing agent code.** Evaluate the same agent backed by Claude Sonnet, Haiku, Nova Pro, or any Bedrock model side-by-side. Switch models via endpoint qualifiers, payload overrides, or separate runtime ARNs — all config-driven.
- **Catch regressions automatically.** Store baselines in S3 and fail the build when scores drop beyond a configurable threshold. No more "the agent feels worse" — you'll have numbers.
- **Ground truth support built in.** Define expected responses, assertions, and tool trajectories directly in your test suite. AgentQualify handles the complex scoping rules — `expectedResponse` is automatically scoped to the correct trace using trace IDs extracted from CloudWatch spans, while `assertions` and `expectedTrajectory` are scoped to the session level. No manual payload construction needed.
- **Resilient span handling.** AgentQualify fetches spans and events from CloudWatch, deduplicates across log groups, and automatically filters out incomplete spans (missing log events) so a single delayed span doesn't block your entire evaluation.
- **Framework-agnostic.** Works with any agent deployed on AgentCore Runtime regardless of framework — Strands, LangGraph, CrewAI, or custom. If it emits OpenTelemetry traces, AgentQualify can evaluate it.
- **Designed for open source.** MIT-0 licensed, minimal dependencies, no vendor lock-in beyond the AWS APIs you're already using.

## Features

- **YAML-driven** — one config file controls everything: agent, models, evaluators, thresholds, baseline, reporting
- **Model comparison** — evaluate the same agent backed by different Bedrock models side-by-side
- **16 built-in evaluators** — Helpfulness, Correctness, GoalSuccessRate, three Trajectory matchers, ToolSelectionAccuracy, and more
- **Ground truth** — expected responses (trace-level), natural-language assertions (session-level), and expected tool trajectories (session-level) — scoping handled automatically
- **Resilient evaluation** — automatic span deduplication, incomplete span filtering, and detailed error reporting
- **Custom evaluators** — reference your own AgentCore custom evaluator ARNs
- **CI/CD gate** — threshold checks + regression checks vs S3 baseline; exits 0/1
- **CloudWatch dashboard** — auto-generated detailed dashboard with scores, trends, token usage, per-test breakdowns
- **CLI + Python SDK** — use from the command line or import in your own code

## Architecture

```
YAML Config + Test Suite
        │
        ▼
  AgentQualify Core
  ┌──────────────────────────────────────────────────────────┐
  │  Agent Invoker (SigV4 or OAuth)                          │
  │  └─► invoke_agent_runtime() per model variant            │
  │       │                                                  │
  │       ▼                                                  │
  │  Span Collector                                          │
  │  ├── Fetch from aws/spans + runtime log group            │
  │  ├── Deduplicate by (traceId, spanId)                    │
  │  ├── Filter incomplete spans (missing log events)        │
  │  └── Extract trace IDs for ground truth scoping          │
  │       │                                                  │
  │       ▼                                                  │
  │  Evaluation Runner ──► AgentCore Evaluate API (boto3)    │
  │  └── Per-evaluator ground truth scoping:                 │
  │      ├── expectedResponse → trace-level (with traceId)   │
  │      ├── assertions → session-level                      │
  │      └── expectedTrajectory → session-level              │
  │       │                                                  │
  │       ▼                                                  │
  │  Results Aggregator                                      │
  │  ├── CloudWatch Metrics Publisher (put_metric_data)      │
  │  ├── Dashboard Generator (put_dashboard)                 │
  │  ├── CI/CD Gate (threshold + regression)                 │
  │  └── Baseline Manager (S3)                               │
  └──────────────────────────────────────────────────────────┘
        │
        ▼
  Exit 0 (pass) / Exit 1 (fail)
```

## Installation

```bash
pip install agentqualify
```

## Quickstart

**1. Create your config file (`agentqualify.yaml`):**

```yaml
agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  region: "us-east-1"

models:
  strategy: "qualifier"
  variants:
    - name: "claude-sonnet"
      qualifier: "sonnet-endpoint"
    - name: "claude-haiku"
      qualifier: "haiku-endpoint"

evaluators:
  - Builtin.Helpfulness
  - Builtin.Correctness
  - Builtin.GoalSuccessRate

input:
  mode: "test_suite"
  test_suite: "agent_tests.yaml"

thresholds:
  Builtin.Helpfulness: 0.70
  Builtin.Correctness: 0.75

baseline:
  enabled: true
  s3_uri: "s3://my-bucket/agentqualify/baseline.json"
  max_regression: 0.05

reporting:
  cloudwatch:
    enabled: true
    namespace: "AgentQualify"
    dashboard_name: "AgentQualify-MyAgent"
```

**2. Create your test suite (`agent_tests.yaml`):**

```yaml
tests:
  - name: "weather_query"
    prompt: "What's the weather in Seattle?"

  - name: "factual_check"
    prompt: "What is the capital of France?"
    expected_response: "The capital of France is Paris."

  # Multi-turn with per-turn expected responses
  - name: "multi_turn"
    turns:
      - prompt: "What is 15 + 27?"
        expected_response: "15 + 27 = 42"
      - prompt: "What's the weather?"
        expected_response: "The weather is sunny"
    assertions:
      - "Agent used the calculator tool for the math question"
    expected_trajectory: ["calculator", "weather"]
```

Ground truth fields (`expected_response`, `assertions`, `expected_trajectory`) are all optional. Omit them to run evaluations without ground truth — the framework is fully backward compatible.

**3. Run:**

```bash
agentqualify run --config agentqualify.yaml
```

## CLI Reference

```
agentqualify run --config agentqualify.yaml                        # Run evaluations + CI/CD gate
agentqualify run --config agentqualify.yaml -o results.json        # Run and save results to JSON
agentqualify run --config agentqualify.yaml --update-baseline      # Run, check gate, and save baseline to S3 on pass
agentqualify baseline update --config agentqualify.yaml            # Run and save results as baseline (unconditional)
agentqualify baseline show --config agentqualify.yaml              # Print current S3 baseline
agentqualify list-evaluators                                       # List all built-in evaluators
```

## Python SDK

```python
from agentqualify import AgentQualify
import sys

result = AgentQualify("agentqualify.yaml").run()
print(result.summary())
sys.exit(0 if result.passed else 1)
```

```python
# Update baseline programmatically
AgentQualify("agentqualify.yaml").update_baseline()
```

## Config Reference

### `agent`
| Field | Type | Required | Description |
|---|---|---|---|
| `runtime_arn` | string | ✅ | AgentCore runtime ARN |
| `agent_id` | string | | Agent ID for span lookup — derived from `runtime_arn` if omitted |
| `region` | string | | AWS region (default: `us-east-1`) |
| `span_wait_seconds` | int | | Wait after invocation for spans (default: `180`) |
| `auth.type` | `sigv4` \| `oauth` | | Authorization method (default: `sigv4`) |
| `auth.oauth.token_url` | string | when `oauth` | OAuth2 token endpoint URL |
| `auth.oauth.client_id` | string | when `oauth` | OAuth2 client ID (supports `${ENV_VAR}`) |
| `auth.oauth.client_secret` | string | when `oauth` | OAuth2 client secret (supports `${ENV_VAR}`) |
| `auth.oauth.scopes` | list of strings | | OAuth2 scopes to request (optional) |

### Authentication

AgentQualify supports two inbound authorization methods for invoking your agent runtime.

#### IAM SigV4 (default)

The default. Uses standard AWS credentials (environment variables, IAM role, SSO profile) via the boto3 SDK. No extra config needed — just make sure your IAM role has `bedrock-agentcore:InvokeAgentRuntime` permission.

```yaml
agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  agent_id: "my-agent-id"
  # auth.type defaults to "sigv4" — no auth block needed
```

#### OAuth (JWT Bearer Token)

When your agent runtime is configured with a [JWT inbound authorizer](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/inbound-jwt-authorizer.html), the boto3 SDK (SigV4) cannot be used. AgentQualify handles this by automatically fetching an access token using the OAuth2 client credentials grant and attaching it as a Bearer token on each HTTPS invocation.

Tokens are cached in memory and refreshed automatically before they expire, so long-running evaluation suites work without interruption.

```yaml
agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  agent_id: "my-agent-id"
  auth:
    type: "oauth"
    oauth:
      token_url: "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_XXXXX/oauth2/token"
      client_id: "${OAUTH_CLIENT_ID}"
      client_secret: "${OAUTH_CLIENT_SECRET}"
      scopes:
        - "agentcore/invoke"
```

Set the environment variables before running:

```bash
export OAUTH_CLIENT_ID="your-client-id"
export OAUTH_CLIENT_SECRET="your-client-secret"
agentqualify run --config agentqualify.yaml
```

All `auth.oauth` string fields support `${ENV_VAR}` syntax so secrets stay out of your YAML files. You can also inline values directly if preferred (e.g., in a CI/CD secret-injected config).

> **Note:** An AgentCore Runtime supports either IAM SigV4 or JWT Bearer Token inbound auth, not both simultaneously. Make sure `auth.type` matches how your runtime is configured.

### `models`

The `strategy` field controls how AgentQualify switches between model variants during invocation. Choose based on how your agent is deployed.

#### Strategy: `qualifier` — one deployment, multiple endpoints

**When to use:** Your agent is deployed once on AgentCore Runtime with multiple endpoint qualifiers, each configured to use a different Bedrock model. This is the most common setup for model comparison — one codebase, one deployment, different model backends.

```yaml
models:
  strategy: "qualifier"
  variants:
    - name: "claude-sonnet-4-5"
      qualifier: "sonnet-endpoint"    # endpoint configured with Claude Sonnet 4.5
    - name: "claude-haiku-3-5"
      qualifier: "haiku-endpoint"     # endpoint configured with Claude Haiku 3.5
    - name: "amazon-nova-pro"
      qualifier: "nova-pro-endpoint"  # endpoint configured with Amazon Nova Pro
```

Best for: "I want to find the best quality/cost tradeoff for my agent without changing any code."

See full example: [`examples/strategy_qualifier.yaml`](examples/strategy_qualifier.yaml)

---

#### Strategy: `payload` — model ID passed in the request

**When to use:** Your agent reads the model ID (or other model config like temperature) from the invocation payload and selects the Bedrock model dynamically at runtime. Use this when you've built a model-agnostic agent or want to test different inference parameters without redeploying.

```yaml
models:
  strategy: "payload"
  variants:
    - name: "claude-sonnet-default-temp"
      payload_override:
        model_id: "anthropic.claude-sonnet-4-5-20250929-v1:0"
        temperature: 0.7
    - name: "claude-sonnet-low-temp"
      payload_override:
        model_id: "anthropic.claude-sonnet-4-5-20250929-v1:0"
        temperature: 0.1            # more deterministic
    - name: "nova-pro"
      payload_override:
        model_id: "amazon.nova-pro-v1:0"
        temperature: 0.7
```

The `payload_override` fields are merged into every test case's payload before invocation.

Best for: "My agent accepts `model_id` in the payload" or "I want to compare different temperature settings on the same model."

See full example: [`examples/strategy_payload.yaml`](examples/strategy_payload.yaml)

---

#### Strategy: `separate_runtimes` — independently deployed agents

**When to use:** Each variant is a completely separate AgentCore Runtime with its own ARN. Use this when different variants have different agent code, tool sets, or system prompts that can't be toggled via a qualifier or payload — or when you're comparing independently built agents (e.g., v1 vs v2, LangGraph vs Strands).

```yaml
models:
  strategy: "separate_runtimes"
  variants:
    - name: "agent-v1-production"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-v1"
    - name: "agent-v2-candidate"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-v2"
    - name: "agent-langgraph-experiment"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-langgraph"
```

Best for: "I rewrote my agent with a new tool set and want to compare v1 vs v2 before promoting to production" or "Two teams built separate agents and I want to benchmark them on the same test suite."

See full example: [`examples/strategy_separate_runtimes.yaml`](examples/strategy_separate_runtimes.yaml)

---

#### Strategy decision guide

| Scenario | Strategy |
|---|---|
| Same agent code, different Bedrock models | `qualifier` |
| Same agent code, different inference params (temperature, max_tokens) | `payload` |
| Agent reads model ID from request payload | `payload` |
| Different agent versions (v1 vs v2) | `separate_runtimes` |
| Different agent frameworks (Strands vs LangGraph) | `separate_runtimes` |
| Production vs canary agent comparison | `separate_runtimes` |
| A/B test between two live deployments | `separate_runtimes` |

---

| Field | Type | Description |
|---|---|---|
| `strategy` | `qualifier` \| `payload` \| `separate_runtimes` | How to switch models |
| `variants[].name` | string | Display name for this model variant |
| `variants[].qualifier` | string | Endpoint qualifier (strategy: `qualifier`) |
| `variants[].payload_override` | dict | Extra payload fields merged into every test (strategy: `payload`) |
| `variants[].runtime_arn` | string | Separate runtime ARN (strategy: `separate_runtimes`) |

### `evaluators`
List of evaluator IDs. Use `Builtin.<Name>` for built-in evaluators or a full ARN for custom evaluators.

**Built-in evaluators:**

| ID | Level | Ground truth field | Description |
|---|---|---|---|
| `Builtin.Correctness` | trace | `expectedResponse` | Factual accuracy compared against expected answer (LLM-as-Judge). Works without ground truth too. |
| `Builtin.GoalSuccessRate` | session | `assertions` | Whether agent behavior satisfies natural-language assertions (LLM-as-Judge) |
| `Builtin.TrajectoryExactOrderMatch` | session | `expectedTrajectory` | Actual tool sequence must match expected exactly — same tools, same order, no extras |
| `Builtin.TrajectoryInOrderMatch` | session | `expectedTrajectory` | Expected tools must appear in order, extra tools allowed between them |
| `Builtin.TrajectoryAnyOrderMatch` | session | `expectedTrajectory` | All expected tools must be present, order doesn't matter, extras allowed |
| `Builtin.Helpfulness` | trace | — | How effectively the response helps users progress toward their goals |
| `Builtin.Coherence` | trace | — | Logical consistency of the response |
| `Builtin.Conciseness` | trace | — | Efficiency of information delivery |
| `Builtin.Faithfulness` | trace | — | Consistency with conversation history |
| `Builtin.Harmfulness` | trace | — | Detection of harmful content |
| `Builtin.InstructionFollowing` | trace | — | Adherence to explicit instructions |
| `Builtin.Refusal` | trace | — | Detection of declined requests |
| `Builtin.ResponseRelevance` | trace | — | How well the response addresses the request |
| `Builtin.Stereotyping` | trace | — | Detection of bias and stereotypical content |
| `Builtin.ToolSelectionAccuracy` | tool | — | Whether the appropriate tool was chosen |
| `Builtin.ToolParameterAccuracy` | tool | — | Whether tool parameters are correct |

The first 5 evaluators accept ground truth. The remaining 11 evaluate based on conversation context alone — they never receive ground truth fields and are unaffected by their presence in your test suite.

### `input`
| Field | Type | Description |
|---|---|---|
| `mode` | `test_suite` \| `sessions` \| `both` | Input source |
| `test_suite` | string | Path to test suite YAML file |
| `sessions[].session_id` | string | Existing session ID to evaluate |
| `sessions[].name` | string | Display name for this session |

### Ground truth (test suite)

Ground truth fields are defined per test case in your test suite YAML. All fields are optional — omit them entirely to run evaluations without ground truth (backward compatible).

| Field | Type | Scope | Used by | Description |
|---|---|---|---|---|
| `expected_response` | string | trace | `Builtin.Correctness` | Reference answer to compare against the agent's response (single-turn or last turn) |
| `turns[].expected_response` | string | trace | `Builtin.Correctness` | Per-turn reference answer for multi-turn tests |
| `assertions` | list of strings | session | `Builtin.GoalSuccessRate` | Natural-language conditions the session must satisfy |
| `expected_trajectory` | list of strings | session | `TrajectoryExactOrderMatch`, `TrajectoryInOrderMatch`, `TrajectoryAnyOrderMatch` | Ordered list of expected tool names |

#### How ground truth scoping works

The AgentCore Evaluate API enforces strict scoping rules for ground truth fields:

- **Session-level** (`assertions`, `expectedTrajectory`) — scoped by `sessionId`. These apply to the entire conversation and are sent to session-level evaluators like `GoalSuccessRate` and the trajectory matchers.
- **Trace-level** (`expectedResponse`) — scoped by `traceId`. This applies to a specific turn in the conversation and is sent to `Builtin.Correctness`.

AgentQualify handles this automatically:

1. **Trace ID extraction** — after fetching spans from CloudWatch, the framework extracts trace IDs from root spans (one per turn) in chronological order.
2. **Automatic scoping** — `expectedResponse` is attached to the last trace ID (matching the convention that a single expected response applies to the final agent reply). `assertions` and `expectedTrajectory` are attached to the session ID.
3. **Per-evaluator routing** — ground truth reference inputs are only sent to evaluators that support them. Evaluators like `Builtin.Helpfulness` never receive ground truth fields, so they're unaffected.
4. **Graceful fallback** — if no trace IDs can be extracted (e.g., spans haven't propagated yet), `expectedResponse` is silently omitted and `Builtin.Correctness` runs in its ground-truth-free mode.

Evaluators that receive ground truth fields they don't use will report them in `ignoredReferenceInputFields` — this is informational, not an error.

**Single-turn with ground truth:**
```yaml
tests:
  - name: "balance_check"
    prompt: "What is my checking account balance?"
    expected_response: "Your checking account balance is $5,420.50 as of today."
    assertions:
      - "Agent called check_balance before get_transaction_history"
      - "Response did not expose full account numbers"
    expected_trajectory: ["check_balance", "get_transaction_history"]
```

**Multi-turn with per-turn expected responses (using `turns` format):**
```yaml
tests:
  - name: "math_then_weather"
    turns:
      - prompt: "What is 15 + 27?"
        expected_response: "15 + 27 = 42"
      - prompt: "What's the weather?"
        expected_response: "The weather is sunny"
    expected_trajectory: ["calculator", "weather"]
```

**Multi-turn with `prompts` list (legacy format):**
```yaml
tests:
  - name: "booking_flow"
    prompts:
      - "Find flights from SEA to NYC"
      - "Book the cheapest one"
    expected_response: "Booked flight DL420 for $290"  # applies to last turn
    assertions:
      - "Agent called search_flights before book_flight"
    expected_trajectory: ["search_flights", "book_flight"]
```

### Evaluation pipeline

When you run `agentqualify run`, the evaluation phase works as follows:

**1. Span collection** — For each session, AgentQualify queries two CloudWatch log groups:
- `aws/spans` — span metadata (trace structure, timing, attributes)
- `/aws/bedrock-agentcore/runtimes/{agent_id}-DEFAULT` — span events (input/output payloads)

Both are required by the Evaluate API. The framework merges results from both log groups.

**2. Deduplication** — Spans that appear in both log groups are deduplicated by `(traceId, spanId, record type)` so the same record is never sent twice.

**3. Incomplete span filtering** — The Evaluate API rejects the entire evaluation if any span with a [supported scope](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/understanding-input-spans.html) (`strands.telemetry.tracer`, `opentelemetry.instrumentation.langchain`, `openinference.instrumentation.langchain`) is missing its corresponding event record. AgentQualify detects these orphaned spans and removes them, so the remaining complete data can still be evaluated. You'll see a warning in the logs:

```
WARNING Filtered 2 incomplete span(s) missing log events (572 → 570)
```

**4. Trace ID extraction** — Root spans (no `parentSpanId`) are identified and their `traceId` values extracted in chronological order. These are used to scope `expectedResponse` ground truth to the correct trace.

**5. Evaluate API calls** — Each evaluator is called separately via `boto3`. Ground truth reference inputs are constructed per-evaluator with correct scoping, and only attached to evaluators that support them.

### `thresholds`
Map of `evaluator_id → minimum_score`. Evaluation fails if any score falls below its threshold.

### `baseline`
| Field | Type | Description |
|---|---|---|
| `enabled` | bool | Enable regression checks — load baseline from S3 and fail the gate if scores drop more than `max_regression`. Does not affect saving. |
| `s3_uri` | string | S3 URI for baseline JSON (`s3://bucket/path.json`). Required for both regression checks and saving. |
| `max_regression` | float | Max allowed score drop (default: `0.05`) |

`baseline.enabled` only controls whether regression checks run during `agentqualify run`. Saving the baseline is controlled separately via `--update-baseline` or `agentqualify baseline update` — both only require `s3_uri` to be set.

### `reporting.cloudwatch`
| Field | Type | Description |
|---|---|---|
| `enabled` | bool | Publish metrics and create dashboard |
| `namespace` | string | CloudWatch namespace (default: `AgentQualify`) |
| `dashboard_name` | string | Dashboard name |

## CI/CD Integration

### GitHub Actions

See [`examples/github-actions.yml`](examples/github-actions.yml) for a complete workflow.

Key pattern:
```yaml
- name: Run evaluations
  run: agentqualify run --config agentqualify.yaml
  # Exits 0 on pass, 1 on threshold/regression failure

- name: Update baseline (main branch only)
  if: github.ref == 'refs/heads/main' && success()
  run: agentqualify baseline update --config agentqualify.yaml
```

### AWS CodePipeline

Add a CodeBuild step with:
```bash
pip install agentqualify[toolkit]
agentqualify run --config agentqualify.yaml
```

The non-zero exit code on failure will automatically fail the pipeline stage.

## IAM Permissions

The IAM role running AgentQualify needs:

```json
{
  "Effect": "Allow",
  "Action": [
    "bedrock-agentcore:InvokeAgentRuntime",
    "bedrock-agentcore:Evaluate",
    "logs:StartQuery",
    "logs:GetQueryResults",
    "cloudwatch:PutMetricData",
    "cloudwatch:PutDashboard",
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "*"
}
```

> **OAuth note:** When using `auth.type: "oauth"`, agent invocation bypasses IAM SigV4 and uses a Bearer token instead, so `bedrock-agentcore:InvokeAgentRuntime` is not needed in your IAM policy. The remaining permissions (Evaluate, CloudWatch, S3, Logs) are still required for evaluations, reporting, and baseline management.

## Development

```bash
git clone https://github.com/agentqualify/agentqualify
cd agentqualify
pip install -e ".[dev]"
pytest
```

Lint and security checks (also run in CI):

```bash
ruff check src/ tests/        # lint
ruff format --check src/ tests/ # format check
bandit -r src/ -c pyproject.toml # static security analysis
pip-audit                       # dependency vulnerability scan
```

## License

MIT-0 — see [LICENSE](LICENSE).
