Metadata-Version: 2.4
Name: holdout-evals
Version: 0.1.0
Summary: An independent significance referee for LLM & agent evals — is your improvement real, or noise?
Project-URL: Homepage, https://holdout.dev
Project-URL: Source, https://github.com/jordan-baillie/holdout
Author: Jordan Baillie
License: MIT
License-File: LICENSE
Keywords: ab-testing,evals,evaluation,llm,mcnemar,overfitting,permutation-test,significance,statistics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: numpy>=1.21
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == 'dev'
Description-Content-Type: text/markdown

# holdout

**An independent significance referee for LLM & agent evals.** Is your improvement real — or
noise, multiple-comparisons inflation, or a model that quietly memorized your test set?

Most eval "wins" don't survive a paired significance test. `holdout` runs the three checks your
eval dashboard skips, in your code or in CI:

1. **Is it signal?** A *paired* test (exact McNemar for pass/fail, paired permutation for graded
   scores) with a real confidence interval — not a naked delta.
2. **Or did you just try a lot of things?** The bar rises with how many variants you tried. The
   max of 37 noisy attempts is *expected* to look like a win.
3. **What would change the verdict?** Power analysis: how many tasks you'd actually need.

The stats are open source (this repo). The hosted service ([holdout.dev](https://holdout.dev))
adds the parts code can't promise: **independence**, a **write-once holdout you can't re-tune
against**, a contamination scan, and a verifiable badge.

## Install

```bash
pip install holdout-evals      # the import name is still `import holdout`
```

## Quickstart — Python

```python
from holdout import compare

# per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
res = compare(baseline_scores, candidate_scores, variants_tried=37)

print(res.report())
print(res.significant)        # False — gate on this
print(res.p_value, res.ci)    # the honest numbers
```

## Quickstart — CLI (drop it in CI)

```bash
python examples/make_example.py     # writes a +4-point "win" that is actually noise

holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
```

```
  Holdout - significance check                                  [FAIL]
  baseline 73.0%  ->  candidate 77.0%   (n = 200 tasks)
  effect          +4.0 pts        95% CI [-0.5, +8.5]
  test            mcnemar_exact   p = 0.134
  variants tried  37   ->  adjusted p = 1.000   (any-false-win risk 85%)
  paired counts   +15 fixed / -7 broke (net +8)

  VERDICT: WITHIN NOISE - not statistically significant.
  -> Don't ship on this alone; the gain is indistinguishable from sampling noise.
     You'd need ~967 tasks for an effect this size to be detectable.
```

`holdout check` **exits non-zero** when the improvement isn't a significant gain — so it blocks a
"ship the noise" merge. As a GitHub Action:

```yaml
- run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}
```

Input is JSONL of `{ "task_id": ..., "score": ... }` (also accepts `correct`/`pass`/`reward`;
booleans and 0/1 become 0.0/1.0). One file per system, joined on `task_id` — or a single
`--paired` file with `baseline` and `candidate` columns.

## How many tasks do I need?

```bash
holdout power --baseline-acc 0.75 --effect 0.03 --variants 37
```

## Why not just compute it yourself?

You can — that's why the math is free. The point of the [hosted service](https://holdout.dev) is
the four things a local script can't credibly promise: an **independent** verdict (we didn't build
the agent), a **write-once holdout** scored exactly once per config (no quiet re-tuning), a
**variants bar that spans your whole team's submissions**, and a **verifiable badge**.

## Reading

The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998),
multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al.
2024, *GSM1k*), and power for evals (Miller 2024, *Adding Error Bars to Evals*).

MIT licensed. Contributions and corrections welcome — that's the whole point.
