Metadata-Version: 2.4
Name: agentloss
Version: 0.0.8
Summary: Measure the real-world error rate and dollar cost of an AI agent's decisions. OpenTelemetry-native.
License: Apache-2.0
Project-URL: Homepage, https://agentloss.com
Project-URL: Repository, https://github.com/ADMT-ai/agentloss
Project-URL: Organization, https://admt.ai
Keywords: ai-agents,agent-evals,production-evals,llm,reliability,error-rate,cost-of-errors,dollar-loss,ground-truth,outcomes,opentelemetry,openinference,agent-observability
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Monitoring
Classifier: Development Status :: 3 - Alpha
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: claude
Requires-Dist: anthropic>=0.40; extra == "claude"
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == "mcp"
Provides-Extra: phoenix
Requires-Dist: arize-phoenix-client>=1.0; extra == "phoenix"
Requires-Dist: pandas>=2.0; extra == "phoenix"
Provides-Extra: langfuse
Requires-Dist: langfuse>=2.0; extra == "langfuse"
Provides-Extra: braintrust
Requires-Dist: braintrust>=0.0.1; extra == "braintrust"
Provides-Extra: stripe
Requires-Dist: stripe>=7.0; extra == "stripe"
Dynamic: license-file

# agentloss

**Your eval tool tells you your AI agent's hallucination rate. `agentloss` tells you what it
costs.** An OpenTelemetry-native SDK that measures the real-world **error rate and dollar
loss** of an AI agent's decisions — by capturing its consequential actions in-process and
joining them to ground truth (real resolved outcomes, not an offline labeled set).

Every eval/observability tool scores *quality proxies* — LLM-judge, hallucination rate, task
completion. `agentloss` answers the question the market keeps asking and no tool measures:
*what are my agent's mistakes costing, and is it safe to trust with more autonomy?*

> Part of **ADMT** (Automated Decision-Making Technology) — [admt.ai](https://admt.ai).

## Install

```bash
pip install agentloss
```

## Quickstart

Instrument only the **consequential action** — the tool call that moves money or commits the
business — not every LLM call.

```python
from agentloss import decision, report_outcome, Decision

@decision                                     # bare decorator; the returned Decision is recorded
def approve_payment(invoice):
    action = run_matching(invoice)            # "approve" | "hold" | "reject"
    return Decision(action=action, value_at_risk_usd=invoice.total,
                    business_key=invoice.number, use_case="ap_3way_match")

# when the outcome resolves (correction, dispute, chargeback, audit, human review):
report_outcome(business_key="INV-1", ground_truth="duplicate-should-block",
               source="recovery_audit", realized_loss_usd=14200)
```

**You already have the ground truth?** (the common case — a disputes / chargebacks table).
That's the default: each reported outcome is a census observation that counts toward the
number, no flags needed. Join the whole table in one line:

```python
from agentloss import record_outcomes

record_outcomes([
    {"business_key": "INV-1", "ground_truth": "reject", "source": "chargeback",
     "realized_loss_usd": 80.0},
    {"business_key": "INV-2", "ground_truth": "approve", "source": "dispute"},  # a CORRECT one
])
```

Report the outcomes that **agreed** with the agent too, not only the disputes — the rate's
denominator is *reported approvals*, so reporting only errors makes it read ~100%. `source`
is one of `recovery_audit | dispute | chargeback | refund | human_queue | verification_agent`.

It computes the error rate by segment (with confidence intervals), **realized + expected dollar
loss**, and the agent's incremental risk vs. a baseline. Raw prompts/records stay in your
boundary; only derived metrics leave.

**Confirm the wiring** — `agentloss.doctor()` inspects the store and catches the silent
failures in plain language (outcomes reported but none counted, only-errors reported, a loss
source that won't be summed). Or from a shell: `agentloss doctor --json`.

## Works with your existing traces (Phoenix / Langfuse / Braintrust / OTel)

Already tracing your agent with OpenInference/OpenTelemetry? Don't re-instrument. Add a few
`agentloss.*` attributes to the consequential span, point agentloss at your spans, and it adds
the loss/outcome layer on top of what your tracer already emits:

```python
from agentloss import ingest_spans, sample_and_verify, print_report

ingest_spans(your_spans)       # OTel/OpenInference spans carrying agentloss.* attributes
sample_and_verify(verify_fn)   # Tier A: get a number with no external labels wired
print_report()                 # error rate by segment + dollar loss
```

See [`examples/from_spans.py`](examples/from_spans.py).

## How it works

- **Instrument consequential actions, not the whole agent.** The costly events are the handful
  of tool calls that move money or commit state.
- **Ground truth arrives late, from outside the agent** — a correction, dispute, audit result,
  or human review. Capture it via `report_outcome`, the human-review queue, and active sampling
  + a verification agent. This is *real resolved outcomes*, not an offline dataset.
- **Honest statistics.** Monetary-unit sampling with a target verifier budget; two-phase
  calibration corrects a fallible verifier's bias back to truth (with confidence intervals).

See [`docs/SDK-SPEC.md`](docs/SDK-SPEC.md) for the full API, `agentloss.*` semantic conventions,
and the pack/adapter model.

## Try the demo

An oracle-validated harness that seeds an accounts-payable environment with *known* errors and
checks that `agentloss` recovers the true error rate and dollar loss:

```bash
python -m dogfood.run                                  # deterministic mock, no deps
AGENTLOSS_VERIFIER_LLM=claude ANTHROPIC_API_KEY=... python -m dogfood.run
```

## For AI coding agents

`agentloss` is built to be discovered and wired by coding agents:
[`llms.txt`](llms.txt), the [`instrument-agent-reliability`](skills/instrument-agent-reliability/SKILL.md)
skill, the [`AGENTS.md`](AGENTS.md) rule, and an [MCP server](mcp/agentloss_mcp.py)
(`how_to_instrument`, `explain_attribute`, `validate_integration`).

## License

Apache-2.0.
