Metadata-Version: 2.4
Name: deltagate
Version: 0.1.0
Summary: Statistical validation for LLM/ML eval comparisons: paired delta CIs, multiple-testing correction, deflated significance, power analysis, and noise diagnostics. Most reported eval deltas are noise — this gates them.
Project-URL: Homepage, https://github.com/yongzhe2160cs/eval-reliability
Project-URL: Source, https://github.com/yongzhe2160cs/eval-reliability
Project-URL: Issues, https://github.com/yongzhe2160cs/eval-reliability/issues
Author: yongzhe2160cs
License: MIT
License-File: LICENSE
Keywords: bca,benchmark,benjamini-hochberg,bootstrap,confidence-interval,evaluation,holm-bonferroni,inspect-ai,llm,lm-evaluation-harness,multiple-testing,power-analysis,reproducibility,significance,statistics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.23
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# deltagate

**Statistical validation for LLM/ML eval comparisons. Most reported eval deltas are noise — this gates them.**

[![CI](https://github.com/yongzhe2160cs/eval-reliability/actions/workflows/ci.yml/badge.svg)](https://github.com/yongzhe2160cs/eval-reliability/actions/workflows/ci.yml)

A model is declared "better" on a one-number delta. A suite of 12 tasks gets
scanned for wins. A sample size is chosen by budget, and a 1-point gap is then
reported as a finding. These claims usually carry no error bars, and the three
ways they go wrong are specific and fixable:

1. **The comparison is paired, but analysed as independent.** Two models run on
   the *same* samples share per-sample difficulty; the correct standard error
   comes from the per-sample *differences*, which can be several times tighter.
   Getting this wrong fails in both directions — the unpaired test misses real
   effects, while eyeballing two means manufactures fake ones.
2. **Multiple comparisons.** Scan 10 null tasks at α=0.05 and a "significant
   win" appears every other suite, by construction.
3. **No power analysis.** If the minimum detectable delta at your n is 3
   points, an observed 1-point delta is unresolvable — more samples needed,
   not more discussion.

`deltagate` is a small, framework-agnostic library (numpy + stdlib, nothing
else) that does this statistics correctly and hands you a decomposable
verdict. It is the sibling of [`edgegate`](https://github.com/yongzhe2160cs/edgegate)
(the same kind of statistical gate, for trading backtests) and generalizes the
eval-reliability toolkit the author contributed to
[Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) (the `ci()`
metric and paired/multiplicity/power helpers).

## Install

```bash
pip install git+https://github.com/yongzhe2160cs/eval-reliability
# or from a clone:  pip install -e ".[dev]"
```

## Sixty seconds

```python
import deltagate as dg

# Two models, same samples — sequences already aligned, or {sample_id: score} dicts:
report = dg.evaluate_comparison(scores_a, scores_b, name="math_word_problems")
print(report.render())
# == math_word_problems ==
#   n=500  mean A=0.5820  mean B=0.5260  delta=+0.0560
#   paired 95% CI [+0.0298, +0.0822]  p=2.712e-05  (paired SE 0.0133)
#   BCa bootstrap CI [+0.0303, +0.0820]
#   standardized delta=0.188  P(real)=1.000
#   min detectable delta at n=500: 0.0374
#   verdict: REAL at alpha=0.05: delta +0.0560

# A whole suite, with Holm / Benjamini-Hochberg correction across tasks:
sr = dg.reliability_report({task: (a, b) for task, (a, b) in suite.items()})
print(sr.render())
```

Real eval outputs plug in through adapters (all stdlib-only):

```python
from deltagate.adapters import (
    LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, compare_runs,
)

# lm-evaluation-harness --log_samples JSONL:
report = compare_runs(LMEvalHarnessAdapter(metric="acc"),
                      "samples_gsm8k_modelA.jsonl", "samples_gsm8k_modelB.jsonl")

# Inspect AI logs (JSON or .eval zip; C/I/P/N mapped like Inspect's value_to_float):
report = compare_runs(InspectLogAdapter(scorer="match"), "logs/a.eval", "logs/b.eval")

# Anything that can dump an (id, score) CSV or JSON:
report = compare_runs(RawScoresAdapter(), "a.csv", "b.csv")
```

## The demo: "A beats B on 5 of 12 tasks!"

`python examples/demo.py` synthesizes realistic lm-evaluation-harness output
for two models across 12 tasks × 500 samples. The generator mirrors how real
model pairs behave: A answers like B on most samples and differs on a few
(fixing some wrong answers, breaking some right ones). **Ground truth: A is
genuinely better on exactly 2 tasks (+5 points); the other 10 are identical
models**, so every other gap is sampling noise. Four readings of the *same
files*:

```text
reading 1 — the leaderboard (compare two means):
  'A beats B' on 5/12 tasks: ['math_word_problems', 'code_completion',
                              'reading_comp', 'translation_fr', 'instruction_following']

reading 2 — unpaired t-test (the textbook test, WRONG for shared samples):
  'significant' on 0/12 tasks: []
  -> MISSES both real effects: ignoring the pairing throws away the per-sample
     difficulty both models share, so the error bars are several times too wide.

reading 3 — paired tests, NO multiplicity control (p < .05 each):
  'significant' on 3/12 tasks: ['math_word_problems', 'code_completion', 'translation_fr']
  -> includes a false positive: scan 10 null tasks at alpha=.05 and flukes
     are expected — this is what suite-level correction is for.

reading 4 — deltagate (paired CIs + Holm/BH across the suite):
  == suite: 12 tasks, alpha=0.05 ==
  naive per-task 'wins'        : 3 ['math_word_problems', 'code_completion', 'translation_fr']
  survive Holm (family-wise)   : 2 ['math_word_problems', 'code_completion']
  survive Benjamini-Hochberg   : 2 ['math_word_problems', 'code_completion']
```

Exactly the two ground-truth effects survive; the fluke dies at the suite
level; and the nulls are reported honestly, with the noise floor attached:

```text
== translation_fr ==                  <- the naive false positive (p=0.027)
  n=500  mean A=0.7460  mean B=0.7200  delta=+0.0260
  paired 95% CI [+0.0029, +0.0491]  p=0.02739
  -> killed by Holm/BH across the 12-task family

== table_qa ==                        <- an honest null
  n=500  delta=-0.0080  p=0.4144
  min detectable delta at n=500: 0.0275  ** observed delta is below this **
  verdict: UNRESOLVED: ... — more samples needed, not more discussion
```

The seed is fixed for reproducibility, not mined: across 40 seeds the unpaired
test misses at least one real effect in ~90% of runs, and ~5% of null tasks
per run clear uncorrected significance — exactly what α predicts.

## Run it on your own files

`examples/compare_runs.py` is the reusable entry point — point it at any two
per-sample score files for the same task:

```bash
python examples/compare_runs.py                      # bundled sample data (lm-eval-shaped)
python examples/compare_runs.py A.jsonl B.jsonl --metric acc            # lm-eval --log_samples
python examples/compare_runs.py a.eval b.eval --format inspect          # Inspect AI logs
python examples/compare_runs.py a.csv  b.csv  --format raw              # plain id,score files
python examples/compare_runs.py A.jsonl B.jsonl --n-trials 25 \
       --trial-deltas "0.02,-0.01,..."               # best-of-N selection correction
```

On the bundled sample pair (400 samples, a real +5.5-point effect):

```text
== samples_gsm8k_modelA vs samples_gsm8k_modelB ==
  n=400  mean A=0.6325  mean B=0.5775  delta=+0.0550
  paired 95% CI [+0.0207, +0.0893]  p=0.001657  (paired SE 0.0175)
  BCa bootstrap CI [+0.0225, +0.0900]
  standardized delta=0.157  P(real)=0.999
  min detectable delta at n=400: 0.0490
  verdict: REAL at alpha=0.05: delta +0.0550
```

If you claim a best-of-N selection correction without supplying the other
trials' deltas, the verdict says so explicitly ("UNCORRECTED for selection")
rather than silently pretending — the library refuses to guess the trial
variance.

## What's in the box

| API | What it gives you |
| --- | --- |
| `paired_delta`, `align_paired` | Paired CI + significance on per-sample differences (the correctness point), with strict id alignment |
| `holm_bonferroni`, `benjamini_hochberg` | Suite-level corrections — family-wise error / false discovery rate — with adjusted p-values |
| `min_samples_for_delta`, `power_for_samples` | Power analysis (textbook check: d/σ = 0.5 at 80% power ⇒ n = 32) |
| `bootstrap_ci`, `bootstrap_delta_ci`, `percentile_stat` | Percentile & **BCa** bootstrap CIs, incl. tail percentiles (p95 score, worst-decile delta) |
| `probabilistic_delta`, `deflated_delta`, `expected_max_std_delta` | Selection-bias-aware significance: "you tried 25 prompt variants and report the best — is the delta still real?" |
| `variance_components`, `minimum_detectable_delta`, `red_flags` | Noise diagnostics: clustered SE + design effect, the eval's noise floor, contamination red flags (identical runs, constant shifts, saturated benchmarks) |
| `evaluate_comparison`, `reliability_report` | One call from two runs (or a suite) to a decomposable verdict |
| `deltagate.adapters` | `LMEvalHarnessAdapter`, `InspectLogAdapter`, `RawScoresAdapter`, and a `ScoresAdapter` protocol for new frameworks |

Design choices worth knowing:

- **Paired everywhere.** Comparison APIs take per-sample scores aligned by id,
  and `align_paired` refuses mismatched id sets rather than silently
  intersecting them (a mismatch usually means a broken run, not a choice).
- **BCa with tie-aware bias correction.** Binary accuracy makes bootstrap
  distributions lumpy; the `z0` estimate half-weights exact ties so discrete
  metrics don't pick up a spurious bias correction.
- **Deflation refuses to guess.** `deflated_delta` with `n_trials > 1` returns
  NaN unless you supply the trial variance/deltas — a silently-guessed
  selection correction would be worse than none.
- **Verdicts decompose.** `ComparisonReport` exposes every number behind the
  verdict (paired stats, BCa bounds, P(real), minimum detectable delta, red
  flags) — "trust me" is the failure mode this library exists to end.

## Statistical provenance

The math is ported from two bodies of prior, separately-verified work by the
same author, not invented here:

- the **Inspect AI eval-reliability contribution** — paired delta,
  Holm/Benjamini-Hochberg, power, variance components, each validated against
  hand computations and `scipy.stats` references in that work's test suite;
- the **`edgegate`** trading-validation library — the Probabilistic/Deflated
  Sharpe Ratio machinery (Bailey & López de Prado, with the full
  skew/kurtosis correction) and the normal inverse-CDF, here adapted from
  return series to standardized score deltas.

This package's own 40-test suite re-asserts the same reference numbers: the
hand-computed paired case (δ=0.75, SE=0.25, p=2Φ(−3)), the Holm/BH reference
p-sets, the textbook power n=32, BCa-vs-normal agreement on symmetric data,
BCa tail-percentile coverage, and DSR ≤ PSR monotonicity.

Honest limits: intervals and p-values use large-sample normal approximations
(fine at eval scale, n ≳ 50 pairs); per-sample scores are assumed exchangeable
within a task (use `variance_components`' cluster support when they aren't);
and red flags are heuristics to investigate, not verdicts.

## Development

```bash
uv venv && uv pip install -e ".[dev]"
pytest -q                              # 40 tests
ruff check . && ruff format --check .
python examples/demo.py
```

### Publishing

The package is PyPI-ready (`python -m build` produces a wheel + sdist that
pass `twine check`; the name `deltagate` was available at the time of
writing). Publishing requires the maintainer's PyPI token:

```bash
python -m build && twine upload dist/*
```

MIT license.

---

*`deltagate` is part of a statistical-rigor-for-AI-evals toolkit: [agentrel](https://github.com/yongzhe2160cs/agent-eval-reliability) (reliability stats for stochastic agent evals), [calibstats](https://github.com/yongzhe2160cs/calibration-toolkit) (calibration metrics with confidence intervals), [leaderboard-ci](https://github.com/yongzhe2160cs/leaderboard-reliability) (leaderboard re-ranking with CIs and tie bands). Full portfolio: [github.com/yongzhe2160cs](https://github.com/yongzhe2160cs).*
