Metadata-Version: 2.4
Name: agent-diagnostics
Version: 0.6.2
Summary: Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.
Project-URL: Repository, https://github.com/sjarmak/agent-diagnostics
Author: CodeScaleBench Team
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: agents,annotation,benchmarks,reliability,taxonomy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: anthropic<1.0,>=0.30
Requires-Dist: duckdb<2.0,>=1.0
Requires-Dist: jsonschema>=4.0
Requires-Dist: pyarrow<18.0,>=15.0
Requires-Dist: pyyaml<7,>=6.0
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# Agent Diagnostics

A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.

**11,995 trials. 4 models. 61 benchmarks. 40 failure categories across 11 dimensions.**

## What this does

Coding agents pass benchmarks for the wrong reasons and fail them for the wrong reasons. Pass/fail scores hide reward hacking, flawed tests, and lucky patches. This project extracts structured signals from agent trajectories, classifies failure modes, and provides a queryable dataset backend so you can actually understand what happened.

```
Trial directories (result.json + trajectory.json)
  -> agent-diagnostics ingest        Extract 31 structured signals per trial
  -> agent-diagnostics annotate      Heuristic failure classification
  -> agent-diagnostics llm-annotate  LLM-assisted classification
  -> agent-diagnostics ensemble      Heuristic + classifier ensemble
  -> agent-diagnostics export        Parquet + MANIFEST.json share artifact
  -> agent-diagnostics query         SQL via DuckDB, zero server
```

## Install

```bash
pip install agent-diagnostics
```

Includes DuckDB, PyArrow, Anthropic SDK, and jsonschema. For development:

```bash
pip install agent-diagnostics[dev]   # pytest, ruff, mypy, coverage
```

## Dataset

The current corpus covers 4 Claude models across 61 benchmark suites:

| Model             | Trials | Pass rate |
| ----------------- | ------ | --------- |
| Claude Haiku 4.5  | 6,443  | 79.1%     |
| Claude Sonnet 4.6 | 4,564  | 73.2%     |
| Claude Opus 4.6   | 677    | 84.5%     |
| Claude Opus 4.5   | 253    | 71.9%     |

Each trial carries 31 structured signals including tool call sequences, files read/edited, duration, error counts, patch size, and a stable content-addressed `trial_id`.

### Query the dataset

```bash
# Pass rates by model
agent-diagnostics query "SELECT model, count(*) as trials,
  round(avg(CASE WHEN passed THEN 1.0 ELSE 0.0 END)*100, 1) as pass_rate
  FROM signals GROUP BY model ORDER BY pass_rate DESC"

# Failure analysis
agent-diagnostics query "SELECT model, count(*) FROM signals WHERE passed = false GROUP BY model"

# Run any of the 5 committed queries
agent-diagnostics query "$(cat docs/queries/tool_sequence_patterns.sql)"
agent-diagnostics query "$(cat docs/queries/per_model_outcomes.sql)"
```

### Export as Parquet

```bash
# 21.87 MB JSONL -> ~1 MB zstd Parquet
agent-diagnostics export --format parquet --out data/export/

# Readable in pandas, Polars, R, DuckDB, any Arrow tool
python3 -c "import pandas as pd; print(pd.read_parquet('data/export/signals.parquet').shape)"
```

The export produces `signals.parquet`, `annotations.parquet`, `manifests.parquet`, and a `MANIFEST.json` with schema version, taxonomy version, row counts, SHA256 checksums, and source commit.

### Schema introspection

```bash
agent-diagnostics db schema --format markdown
agent-diagnostics db schema --format json
```

## Taxonomy

40 categories across 11 behavioral dimensions (v3):

| Dimension     | Categories | Examples                                                                           |
| ------------- | ---------- | ---------------------------------------------------------------------------------- |
| Retrieval     | 3          | `retrieval_failure`, `query_churn`, `context_window_overflow`                      |
| ToolUse       | 4          | `wrong_tool_selection`, `tool_argument_error`, `tool_misinterpretation`            |
| Reasoning     | 3          | `decomposition_failure`, `incorrect_root_cause`, `overconfident_diagnosis`         |
| Execution     | 5          | `edit_verify_loop_failure`, `syntax_error_loop`, `incomplete_implementation`       |
| Environment   | 4          | `exception_crash`, `rate_limited_run`, `environment_mismatch`                      |
| Faithfulness  | 2          | `task_misunderstanding`, `scope_drift`                                             |
| Metacognition | 5          | `premature_submission`, `excessive_exploration`, `sunk_cost_persistence`           |
| Integrity     | 2          | `test_file_modification`, `reward_hacking`                                         |
| Safety        | 3          | `data_exfiltration_attempt`, `sandbox_escape`, `destructive_operation`             |
| Strategy      | 6          | `success_via_code_nav`, `success_via_semantic_search`, `success_via_decomposition` |
| Observability | 3          | `insufficient_provenance`, `task_ambiguity`, `unreproducible_result`               |

```python
from agent_diagnostics import load_taxonomy, valid_category_names

taxonomy = load_taxonomy()
names = valid_category_names()
```

## Annotation pipeline

### Heuristic annotation

23 rule-based classifiers that fire on signal patterns (e.g., `retrieval_failure` when search calls = 0 and files read = 0):

```bash
agent-diagnostics annotate --signals data/signals.json --output heuristic.json
```

### LLM annotation

Reads actual trajectories and classifies with Claude. Supports `claude-code`, `api`, and `batch` (Message Batches API, 50% cheaper) backends:

```bash
agent-diagnostics llm-annotate --signals data/signals.json --output llm.json \
    --sample-size 50 --model haiku --backend batch
```

### Ensemble (heuristic + classifier)

Two-tier: heuristic rules for structural categories, trained classifier for learned categories:

```bash
agent-diagnostics train --labels llm.json --signals signals.json --output model.json
agent-diagnostics ensemble --signals signals.json --model model.json --output ensemble.json
```

### Annotation store

All annotation writers can route through a shared `AnnotationStore` that enforces primary key uniqueness, atomic writes, and version consistency:

```bash
agent-diagnostics annotate --signals data/signals.json --output heuristic.json \
    --annotations-out data/annotations.jsonl

agent-diagnostics ensemble --signals data/signals.json --model model.json \
    --output ensemble.json --annotations-out data/annotations.jsonl
```

The store uses PK `(trial_id, category_name, annotator_type, annotator_identity, taxonomy_version)` so multiple annotators (heuristic, LLM, classifier, ensemble, human) can label the same trial without collision.

## CLI reference

```
agent-diagnostics extract          Extract signals from trial directories
agent-diagnostics ingest           Filter -> extract -> enrich -> write JSONL pipeline
agent-diagnostics annotate         Heuristic annotation
agent-diagnostics llm-annotate     LLM-assisted annotation
agent-diagnostics train            Train per-category classifiers
agent-diagnostics predict          Predict with trained classifier
agent-diagnostics ensemble         Two-tier ensemble annotation
agent-diagnostics report           Generate Markdown + JSON report
agent-diagnostics validate         Validate annotations against schema
agent-diagnostics query            Run SQL against the dataset (DuckDB)
agent-diagnostics export           Export to Parquet with MANIFEST.json
agent-diagnostics manifest refresh Rewrite manifests.jsonl
agent-diagnostics db schema        Inspect table schemas
```

## Data formats

### signals.jsonl

One JSON object per line. 31 fields per trial including `trial_id` (stable SHA256-based), model, benchmark, reward, pass/fail, tool call counts/sequences, files read/edited, duration, error counts, and patch size.

### annotations.jsonl

Narrow-tall schema — one row per (trial, category, annotator):

| Column               | Description                                              |
| -------------------- | -------------------------------------------------------- |
| `trial_id`           | SHA256-based stable identifier                           |
| `category_name`      | Taxonomy category (e.g., `retrieval_failure`)            |
| `confidence`         | 0.0 to 1.0                                               |
| `evidence`           | Free-text explanation                                    |
| `annotator_type`     | `heuristic`, `llm`, `classifier`, `ensemble`, or `human` |
| `annotator_identity` | e.g., `heuristic:rule-engine`, `llm:haiku-4`             |
| `taxonomy_version`   | e.g., `3.0.0`                                            |
| `annotated_at`       | ISO 8601 timestamp                                       |

### Parquet export

`agent-diagnostics export` produces zstd-compressed Parquet with native `list<string>` columns. The 21.87 MB JSONL corpus compresses to ~1 MB. Includes `MANIFEST.json` for provenance.

## Architecture

```
agent_diagnostics/
  signals.py           Signal extraction + trial_id computation (31 fields)
  types.py             TrialSignals TypedDict, CategoryAssignment, Annotation
  annotator.py         23-rule heuristic annotator
  classifier.py        Pure-Python logistic regression (no numpy)
  ensemble.py          Two-tier ensemble (heuristic + classifier)
  llm_annotator.py     LLM annotation (claude-code, API, batch backends)
  annotation_store.py  Narrow-tall JSONL store with PK enforcement + flock
  model_identity.py    Logical annotator identity resolution via models.yaml
  query.py             DuckDB query engine (JSONL + Parquet)
  export.py            Parquet export with MANIFEST.json
  report.py            Markdown + JSON report generator
  calibrate.py         Agreement analysis, Cohen's kappa
  blend_labels.py      LLM + heuristic label blending
  taxonomy.py          Taxonomy loader (v1/v2/v3 YAML)
  tool_registry.py     Injectable tool name registry
  cli.py               CLI entrypoint (14 subcommands)
```

## Contributing

We welcome contributions of agent trace data, new benchmark integrations, taxonomy refinements, and annotation tooling. If you're building evaluation infrastructure for coding agents, we'd love to talk.

## License

Apache-2.0
