Metadata-Version: 2.4
Name: evalconfidence
Version: 0.1.1
Summary: Decision-grade statistics for AI evals: paired comparisons, cluster-aware uncertainty, and power analysis on top of existing eval frameworks.
Project-URL: Homepage, https://github.com/stephlinds/evalconfidence
Project-URL: Repository, https://github.com/stephlinds/evalconfidence
Project-URL: Issues, https://github.com/stephlinds/evalconfidence/issues
Author: Stephen Chang Lin
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Provides-Extra: demo
Requires-Dist: jupyter>=1.0; extra == 'demo'
Requires-Dist: matplotlib>=3.7; extra == 'demo'
Provides-Extra: dev
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Provides-Extra: inspect
Requires-Dist: inspect-ai>=0.3; extra == 'inspect'
Description-Content-Type: text/markdown

# evalconfidence

[![CI](https://github.com/stephlinds/evalconfidence/actions/workflows/ci.yml/badge.svg)](https://github.com/stephlinds/evalconfidence/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/evalconfidence.svg)](https://pypi.org/project/evalconfidence/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

**Decision-grade statistics for AI evals.** A companion layer — not another framework — that adds paired comparisons, dependence-aware uncertainty, and power analysis on top of the eval stack you already use (Inspect AI, or anything that can produce a dataframe).

> **Status: v0.1.0 — first public release.** The full statistics layer: `standard_error()`, `compare()`, `power()`, the adapters, CI on Python 3.10–3.13, and a demo notebook ([examples/demo.ipynb](examples/demo.ipynb)) on real GPQA Diamond results (198 items × 5 epochs × 2 models) that re-runs from the committed scores CSV with zero API keys.

## The gap, stated honestly

Existing frameworks *do* quantify uncertainty: Inspect AI computes per-eval standard errors via the CLT, offers bootstrapping for non-mean statistics, and — since v0.3.64 (Feb 2025) — supports clustered standard errors via `stderr(cluster=...)` when you declare a grouping field. What they give you is a defensible standard error on a **single** score. What none of them give you (checked against the Inspect changelog and DeepEval metrics list, June 2026):

- **Rigorous comparison between two systems** — paired tests that exploit shared items, a CI on the *difference*, McNemar for binary scores. The universal practice is still eyeballing two separate intervals, which is an unpaired test at its maximum variance.
- **Power / sample-size planning** before you spend the inference budget — how many items to detect the gap you care about, or the smallest gap your benchmark can see at all.

On dependence-aware uncertainty the gap is narrower and we say so: Inspect can cluster if you name the grouping up front. This package adds the **diagnostic** framing — naive and cluster-robust side by side with the inflation factor, epoch structure auto-detected — and works on results from any framework, not just Inspect tasks configured with custom metrics.

That's the whole scope of this package: results in, rigorous comparison out. No model calls, no orchestration, no tracing.

### Capability matrix

| Capability | Existing frameworks | evalconfidence |
|---|---|---|
| Run / orchestrate / trace / score evals | Yes | No (consumes results) |
| Single-score standard error | Yes (Inspect: CLT, bootstrap) | Re-derives, reported side by side |
| Clustered standard errors | Partial (Inspect `stderr(cluster=...)`, declared field) | **Yes — auto-detected epochs, inflation factor, any framework** |
| **Paired comparison of two systems** | No | **Yes — paired-t / McNemar, CI on the difference** |
| **Power / minimum detectable effect** | No | **Yes — n ↔ MDE, pairing- and cluster-aware** |
| Judge debiasing (PPI) | No | Planned (v2) |

For the full technical argument — how dependence-blind SEs manufacture false wins at a real α of ~25–30%, how unpaired comparisons silently bury real improvements, and why underpowered evals cause *both* errors — see [docs/why-it-works.md](docs/why-it-works.md).

## How it works: the two-stage flow

This package never makes API calls — `model_id` is just a grouping label, never an endpoint. The flow has two stages, and the package only lives in the second:

1. **Generation (upstream, not this package).** An eval framework runs the model against the benchmark and grades outputs. This is where API calls, keys, and cost live. Inspect AI saves its own durable record automatically — a `.eval` log in `./logs/` with every prompt, response, and score per sample. A homegrown harness's CSV plays the same role.
2. **Analysis (this package).** An adapter reads that already-existing record into the normalized `ItemResult` rows — `from_inspect()` for `.eval` logs, `from_dataframe()` for anything tabular — and the statistics functions compute on those fixed numbers. No model is ever consulted again.

This separation is what makes analyses cheaply reproducible: pay for stage 1 once, keep the log/CSV, and re-run stage 2 forever for free.

**What gets saved:** stage-1 artifacts are saved by whoever produced them (Inspect does this automatically). Stage-2 outputs are returned as in-memory dataclasses (`SEResult`, ...) — print them or serialize with `dataclasses.asdict()`; the package deliberately doesn't persist analysis results, because the saved stage-1 record is the thing worth keeping and the statistics re-run in milliseconds.

## Quick example

```python
from evalconfidence import from_inspect, compare, power, standard_error

results_a = from_inspect("logs/full/..._gpqa-diamond_....eval")  # 198 items x 5 epochs
results_b = from_inspect("logs/full/..._gpqa-diamond_....eval")

print(compare(results_a, results_b))          # pairs on shared items automatically
print(standard_error(results_a))              # naive vs cluster-robust, side by side
print(power((results_a, results_b), mde=0.06))  # items needed to detect 6 points
```

Output — real GPQA Diamond results, gpt-5-nano vs gpt-5.4-mini at default settings (the committed scores CSV reproduces this without keys; see [the demo notebook](examples/demo.ipynb)):

```
openai/gpt-5-nano-2025-08-07 is estimated to outperform openai/gpt-5.4-mini-2026-03-17
by 5.9 points, 95% CI [0.4, 11.3] (A−B). The difference is significant at alpha=0.05
(p=0.0363, paired_t).
Pairing reduced the comparison variance by 2.1x: the 198 paired items deliver
the precision of ~420 unpaired items.

Mean score: 0.6758  (n=990 observations)
  Naive i.i.d. SE:    0.0149  ->  95% CI [0.6465, 0.7050]
  Cluster-robust SE:  0.0276  ->  95% CI [0.6213, 0.7302]  (198 clusters by item)
  Inflation: 1.85x  (design effect 3.44)

Detecting a 6.0 points gap at alpha=0.05 with 80% power requires ~334 paired items.
```

The same data, compared unpaired (the eyeball-the-two-intervals test), give 95% CI [−2.1, +13.8], p = 0.15 — a real 5.9-point edge written off as noise. The full story, with figures and the pilot-based power analysis that designed the run, is in the [demo notebook](examples/demo.ipynb).

Not on Inspect? Use the escape hatch:

```python
from evalconfidence import from_dataframe
results = from_dataframe(df, item_id="qid", model_id="system", score="acc")
```

## Install

```bash
pip install evalconfidence              # core: numpy + scipy only
pip install "evalconfidence[inspect]"   # + Inspect AI log reading
```

For development (from a clone):

```bash
pip install -e ".[dev]"     # + pytest, pandas (for tests)
pip install -e ".[demo]"    # + matplotlib, jupyter (for the demo notebook)
```

## What's here (v0.1.0)

- [x] `ItemResult` normalized representation + `from_inspect` + `from_dataframe`
- [x] `standard_error()` — naive vs. cluster-robust side by side, inflation factor
- [x] `compare()` — paired comparison of two systems (paired-t / McNemar), variance-reduction factor, unpaired fallback with warning
- [x] `power()` — required n ↔ minimum detectable effect, pairing- and cluster-aware
- [x] Demo notebook — three figures (wrong winner / false confidence / budget planning) on real GPQA Diamond data, generated for ~$4 and re-runnable from the committed CSV with no keys: [examples/demo.ipynb](examples/demo.ipynb)

### On the roadmap

- PPI (prediction-powered inference) for debiasing LLM-judge scores
- Multiple-comparison correction for task suites
- Possible upstream contribution to Inspect AI ([#4206](https://github.com/UKGovernmentBEIS/inspect_ai/issues/4206) tracks a related proposal)

License: Apache-2.0
