Metadata-Version: 2.4
Name: agent-diagnostics
Version: 0.7.0
Summary: Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.
Project-URL: Repository, https://github.com/sjarmak/agent-diagnostics
Author: CodeScaleBench Team
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: agents,annotation,benchmarks,reliability,taxonomy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: anthropic<1.0,>=0.30
Requires-Dist: duckdb<2.0,>=1.0
Requires-Dist: jsonschema>=4.0
Requires-Dist: pyarrow<18.0,>=15.0
Requires-Dist: pyyaml<7,>=6.0
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# Agent Diagnostics

A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.

**11,995 trials. 4 models. 61 benchmarks. 40 failure categories across 11 dimensions.**

## What this does

Coding agents pass benchmarks for the wrong reasons and fail them for the wrong reasons. Pass/fail scores hide reward hacking, flawed tests, and lucky patches. This project extracts structured signals from agent trajectories, classifies failure modes, and provides a queryable dataset backend so you can actually understand what happened.

## Install

```bash
pip install agent-diagnostics
```

## Quick start: label your own traces

Point the tool at a directory of agent trial outputs. Each trial needs a `result.json` (and optionally a `trajectory.json`) in its own directory:

```
my-runs/
  trial-001/
    result.json          # must have: task_id or task_name, reward/score
    trajectory.json      # optional: list of agent steps with tool calls
  trial-002/
    result.json
    agent/
      trajectory.json    # also checks agent/ subdirectory
  ...
```

Then run the pipeline:

```bash
# Step 1: Extract signals from all trial directories into JSONL
agent-diagnostics ingest --runs-dir my-runs/ --output data/signals.jsonl

# Step 2: Classify failure modes (heuristic rules, instant)
agent-diagnostics annotate --signals data/signals.jsonl --output data/heuristic.json

# Step 3: LLM-assisted classification (reads actual trajectories)
agent-diagnostics llm-annotate --signals data/signals.jsonl --output data/llm.json \
    --sample-size 50 --model haiku --backend batch

# Step 4: Query results with SQL
agent-diagnostics query "SELECT model, count(*) as n,
  round(avg(CASE WHEN passed THEN 1.0 ELSE 0.0 END)*100, 1) as pass_rate
  FROM signals GROUP BY model ORDER BY pass_rate DESC"
```

### What `result.json` looks like

The tool reads standard benchmark harness output. At minimum:

```json
{
  "task_name": "django__django-16527",
  "reward": 1.0,
  "agent_info": { "name": "claude-code" },
  "started_at": "2026-01-15T10:00:00Z",
  "finished_at": "2026-01-15T10:04:32Z"
}
```

Works out of the box with SWE-bench, OpenHands, and similar harnesses that write `result.json` per trial.

### What you get back

`signals.jsonl` — one row per trial with 31 structured fields:

```json
{
  "trial_id": "726c23ceb1ce7cf2...",
  "task_id": "django__django-16527",
  "model": "claude-sonnet-4-6",
  "reward": 1.0,
  "passed": true,
  "total_turns": 57,
  "tool_calls_total": 32,
  "search_tool_calls": 8,
  "edit_tool_calls": 12,
  "unique_files_read": 5,
  "unique_files_edited": 2,
  "duration_seconds": 272.0,
  "exception_crashed": false,
  "tool_call_sequence": ["read_file", "search", "edit_file", "..."],
  "..."
}
```

## Included dataset

The repo ships a Parquet export of 11,995 trials in `data/export/` (~1.5 MB):

| Model             | Trials | Pass rate |
| ----------------- | ------ | --------- |
| Claude Haiku 4.5  | 6,443  | 79.1%     |
| Claude Sonnet 4.6 | 4,564  | 73.2%     |
| Claude Opus 4.6   | 677    | 84.5%     |
| Claude Opus 4.5   | 253    | 71.9%     |

Query it immediately after cloning:

```bash
agent-diagnostics query "SELECT model, count(*) as trials FROM signals GROUP BY model"

# Or load directly
python3 -c "import pandas as pd; print(pd.read_parquet('data/export/signals.parquet').describe())"
```

### Pre-built queries

Five analysis queries are in `docs/queries/`:

```bash
agent-diagnostics query "$(cat docs/queries/per_model_outcomes.sql)"
agent-diagnostics query "$(cat docs/queries/benchmark_model_matrix.sql)"
agent-diagnostics query "$(cat docs/queries/annotation_cooccurrence.sql)"
agent-diagnostics query "$(cat docs/queries/tool_sequence_patterns.sql)"
agent-diagnostics query "$(cat docs/queries/eval_subset_export.sql)"
```

### Export your own Parquet

```bash
agent-diagnostics export --format parquet --out data/export/
```

Produces zstd-compressed Parquet with native `list<string>` columns, plus `MANIFEST.json` with schema version, row counts, SHA256 checksums, and source commit.

### Schema introspection

```bash
agent-diagnostics db schema --format markdown
agent-diagnostics db schema --format json
```

## Taxonomy

40 categories across 11 behavioral dimensions (v3):

| Dimension     | Categories | Examples                                                                           |
| ------------- | ---------- | ---------------------------------------------------------------------------------- |
| Retrieval     | 3          | `retrieval_failure`, `query_churn`, `context_window_overflow`                      |
| ToolUse       | 4          | `wrong_tool_selection`, `tool_argument_error`, `tool_misinterpretation`            |
| Reasoning     | 3          | `decomposition_failure`, `incorrect_root_cause`, `overconfident_diagnosis`         |
| Execution     | 5          | `edit_verify_loop_failure`, `syntax_error_loop`, `incomplete_implementation`       |
| Environment   | 4          | `exception_crash`, `rate_limited_run`, `environment_mismatch`                      |
| Faithfulness  | 2          | `task_misunderstanding`, `scope_drift`                                             |
| Metacognition | 5          | `premature_submission`, `excessive_exploration`, `sunk_cost_persistence`           |
| Integrity     | 2          | `test_file_modification`, `reward_hacking`                                         |
| Safety        | 3          | `data_exfiltration_attempt`, `sandbox_escape`, `destructive_operation`             |
| Strategy      | 6          | `success_via_code_nav`, `success_via_semantic_search`, `success_via_decomposition` |
| Observability | 3          | `insufficient_provenance`, `task_ambiguity`, `unreproducible_result`               |

```python
from agent_diagnostics import load_taxonomy, valid_category_names

taxonomy = load_taxonomy()
names = valid_category_names()
```

## Annotation pipeline

### Heuristic annotation

23 rule-based classifiers that fire on signal patterns (e.g., `retrieval_failure` when search calls = 0 and files read = 0):

```bash
agent-diagnostics annotate --signals data/signals.jsonl --output heuristic.json
```

### LLM annotation

Reads actual trajectories and classifies with Claude. Supports `claude-code`, `api`, and `batch` (Message Batches API, 50% cheaper) backends:

```bash
agent-diagnostics llm-annotate --signals data/signals.jsonl --output llm.json \
    --sample-size 50 --model haiku --backend batch
```

### Ensemble (heuristic + classifier)

Two-tier: heuristic rules for structural categories, trained classifier for learned categories:

```bash
agent-diagnostics train --labels llm.json --signals data/signals.jsonl --output model.json
agent-diagnostics ensemble --signals data/signals.jsonl --model model.json --output ensemble.json
```

### Annotation store

All annotation writers can route through a shared `AnnotationStore` that enforces primary key uniqueness, atomic writes, and version consistency:

```bash
agent-diagnostics annotate --signals data/signals.jsonl --output heuristic.json \
    --annotations-out data/annotations.jsonl

agent-diagnostics ensemble --signals data/signals.jsonl --model model.json \
    --output ensemble.json --annotations-out data/annotations.jsonl
```

The store uses PK `(trial_id, category_name, annotator_type, annotator_identity, taxonomy_version)` so multiple annotators (heuristic, LLM, classifier, ensemble, human) can label the same trial without collision.

## Data formats

### signals.jsonl

One JSON object per line. 31 fields per trial including `trial_id` (stable SHA256-based), model, benchmark, reward, pass/fail, tool call counts/sequences, files read/edited, duration, error counts, and patch size.

### annotations.jsonl

Narrow-tall schema — one row per (trial, category, annotator):

| Column               | Description                                              |
| -------------------- | -------------------------------------------------------- |
| `trial_id`           | SHA256-based stable identifier                           |
| `category_name`      | Taxonomy category (e.g., `retrieval_failure`)            |
| `confidence`         | 0.0 to 1.0                                               |
| `evidence`           | Free-text explanation                                    |
| `annotator_type`     | `heuristic`, `llm`, `classifier`, `ensemble`, or `human` |
| `annotator_identity` | e.g., `heuristic:rule-engine`, `llm:haiku-4`             |
| `taxonomy_version`   | e.g., `3.0.0`                                            |
| `annotated_at`       | ISO 8601 timestamp                                       |

## CLI reference

```
agent-diagnostics ingest           Ingest trial directories into signals.jsonl
agent-diagnostics extract          Extract signals from a single trial directory
agent-diagnostics annotate         Heuristic annotation
agent-diagnostics llm-annotate     LLM-assisted annotation
agent-diagnostics train            Train per-category classifiers
agent-diagnostics predict          Predict with trained classifier
agent-diagnostics ensemble         Two-tier ensemble annotation
agent-diagnostics report           Generate Markdown + JSON report
agent-diagnostics validate         Validate annotations against schema
agent-diagnostics query            Run SQL against the dataset (DuckDB)
agent-diagnostics export           Export to Parquet with MANIFEST.json
agent-diagnostics manifest refresh Rewrite manifests.jsonl
agent-diagnostics db schema        Inspect table schemas
```

## Contributing

We welcome contributions of agent trace data, new benchmark integrations, taxonomy refinements, and annotation tooling. If you're building evaluation infrastructure for coding agents, we'd love to talk.

## License

Apache-2.0
