Metadata-Version: 2.4
Name: agent-diagnostics
Version: 0.5.0
Summary: Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.
Project-URL: Repository, https://github.com/sourcegraph/agent-observatory
Author: CodeScaleBench Team
License-Expression: Apache-2.0
Keywords: agents,annotation,benchmarks,reliability,taxonomy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: pyyaml<7,>=6.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: llm
Requires-Dist: anthropic>=0.30; extra == 'llm'
Provides-Extra: validation
Requires-Dist: jsonschema>=4.0; extra == 'validation'
Description-Content-Type: text/markdown

# Agent Reliability Observatory

A behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail on benchmark tasks.

## Install

```bash
pip install agent-diagnostics
```

## Quick Start

```python
from agent_diagnostics import load_taxonomy, valid_category_names

# Load the 23-category behavioral taxonomy
taxonomy = load_taxonomy()
print(f"{len(taxonomy['categories'])} categories")

# Get valid category names
names = valid_category_names()
print(names)

# Validate an annotation
from agent_diagnostics import validate_annotation_categories

annotation = {
    "categories": [
        {"name": "retrieval_failure", "confidence": 0.9},
    ]
}
validate_annotation_categories(annotation)  # raises ValueError if invalid
```

## Taxonomy

The taxonomy organizes agent behaviors into three polarities:

| Polarity | Count | Purpose                                         |
| -------- | ----- | ----------------------------------------------- |
| failure  | 16    | Explains why the agent failed or underperformed |
| success  | 5     | Explains which strategy led to success          |
| neutral  | 2-3   | Contextual factors that affect interpretation   |

## Taxonomy Versions

- **v1** (flat): Categories in a flat list with `name`, `description`, `polarity`, `detection_hints`, `examples`
- **v2** (hierarchical): Categories organized by dimension (Retrieval, Execution, etc.)

```python
from agent_diagnostics.taxonomy import load_taxonomy, _package_data_path

# Load v2 (hierarchical dimensions)
v2 = load_taxonomy(_package_data_path("taxonomy_v2.yaml"))
```

## Annotation Schema

The package includes a JSON Schema for machine-readable annotations:

```python
from agent_diagnostics.taxonomy import get_schema_path

schema_path = get_schema_path()
```

## Exemplars

25 hand-annotated examples covering all 23 taxonomy categories are bundled with the package under `exemplars/`.

## License

Apache-2.0
