Metadata-Version: 2.3
Name: calibstats
Version: 0.1.0
Summary: Calibration metrics with bootstrap confidence intervals — because a bare ECE is not enough.
Keywords: calibration,ece,reliability,uncertainty,machine-learning,evaluation
Author: yongzhe2160cs
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Dist: numpy>=1.23
Requires-Dist: scipy>=1.9
Requires-Dist: matplotlib>=3.6 ; extra == 'viz'
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/yongzhe2160cs/calibration-toolkit
Project-URL: Repository, https://github.com/yongzhe2160cs/calibration-toolkit
Provides-Extra: viz
Description-Content-Type: text/markdown

# calibstats

**Calibration metrics with confidence intervals — because a bare ECE number is not enough.**

[![CI](https://github.com/yongzhe2160cs/calibration-toolkit/actions/workflows/ci.yml/badge.svg)](https://github.com/yongzhe2160cs/calibration-toolkit/actions/workflows/ci.yml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

A model says it is 90% confident. Is it right 90% of the time? *Calibration* asks
whether predicted probabilities match observed frequencies. The standard metric,
**ECE** (Expected Calibration Error), is almost always reported as a single
number — and that number is **biased and noisy** at the sample sizes real evals
run on.

`calibstats` reports every metric as **`estimate ± CI`**, ships a **bias-corrected
estimator**, and adds reliability diagrams with confidence bands, post-hoc
recalibration, and subgroup-shift significance tests. It is framework-agnostic:
feed it `(predicted_prob, label)` arrays from any model.

## Why calibration needs confidence intervals

On a **perfectly calibrated** model (true ECE = 0), the plug-in ECE estimator
still reports **0.12 at n = 100** and **0.05 at n = 500**. That is pure
estimator bias — the absolute value in `|accuracy − confidence|` cannot average
to zero from finite, noisy bins. So:

- **An "ECE = 0.05" headline can mean a perfectly calibrated model** at small n.
- **ECEs are not comparable across papers** unless n *and* bin count match.
- **A point estimate with no interval hides whether you have signal or noise** —
  at n = 200 the 95% CI for ECE is ~0.10 wide, often wider than the value itself.

The fix is not exotic: attach a bootstrap CI, correct the bias, and test
differences instead of eyeballing them. That is the whole pitch. See
[**STUDY.md**](STUDY.md) for the quantitative demonstration and figures.

<p align="center">
  <img src="study/figures/ece_bias_vs_n.png" width="520" alt="ECE bias vs sample size">
</p>

## Install

```bash
pip install calibstats          # core (numpy, scipy)
pip install calibstats[viz]     # + matplotlib for reliability_diagram
```

Or with [uv](https://docs.astral.sh/uv/) from a clone:

```bash
uv sync --extra viz
```

## Quick start

```python
import calibstats as cs

# Binary: probs is 1-D P(y=1); labels in {0,1}.
# (Here a synthetic overconfident model with a known temperature.)
data = cs.make_binary(2000, temperature=2.0, seed=0)

# The whole picture, every metric with a shared bootstrap CI:
report = cs.calibration_report(data.probs, data.labels, n_boot=2000)
print(report)
```

```
Calibration report  (n=2000, 15 uniform bins)
----------------------------------------------------------
metric            estimate                95% CI
----------------------------------------------------------
ece                 0.0966      [0.0842, 0.1171]
ace                 0.0924      [0.0803, 0.1137]
mce                 0.1928      [0.1570, 0.2812]
debiased_ece        0.0970      [0.0853, 0.1242]
brier               0.1742      [0.1620, 0.1867]
nll                 0.5648      [0.5245, 0.6071]
----------------------------------------------------------
Brier decomposition (cal - res + unc = brier):
  calibration=0.0110  resolution=0.0862  uncertainty=0.2500
```

### Individual metrics, each CI-ready

```python
cs.ece(probs, labels)                 # Expected Calibration Error (equal-width, l1)
cs.ace(probs, labels)                 # Adaptive ECE (equal-mass / quantile bins)
cs.mce(probs, labels)                 # Maximum Calibration Error
cs.debiased_ece(probs, labels)        # bias-corrected (l2) estimator
cs.brier_score(probs, labels)         # mean squared error of probabilities
cs.brier_decomposition(probs, labels) # calibration / refinement / resolution / uncertainty
cs.nll(probs, labels)                 # negative log-likelihood (log loss)

# Wrap ANY of them in a bootstrap CI:
ci = cs.bootstrap_ci(probs, labels, cs.ece, n_boot=2000)
print(ci)                # 0.0966  [0.0842, 0.1171] (95% CI)
print(ci.estimate, ci.ci, ci.se)
```

### Reliability diagram with confidence bands

```python
import matplotlib.pyplot as plt
cs.reliability_diagram(probs, labels, n_bins=15)
plt.savefig("reliability.png")
# Or get the bins without plotting (Wilson 95% bands included):
bins = cs.reliability_data(probs, labels, n_bins=15)
```

### Recalibration (fit on a holdout)

```python
ts = cs.TemperatureScaler(input="probs").fit(probs_val, labels_val)
probs_test_cal = ts.transform(probs_test)
print("recovered T:", ts.temperature_)

# Binary-only, two-parameter alternative:
ps = cs.PlattScaler(input="probs").fit(probs_val, labels_val)
```

`TemperatureScaler` handles binary (1-D) and multiclass (2-D) and accepts either
`input="logits"` or `input="probs"`. It is accuracy-preserving — argmax never
changes.

### Calibration under shift / across subgroups

```python
groups = cs.compare_subgroups({
    "in_domain":  (p_a, y_a),
    "out_domain": (p_b, y_b),
}, n_boot=2000)
for g in groups:
    print(g.name, g.ece)            # per-group ECE ± CI

# Is the ECE difference real? (paired=True when both score the same examples)
test = cs.ece_difference_test(p_a, y_a, p_b, y_b, n_boot=2000)
print(test)                         # Δ with CI and a bootstrap p-value
```

### Multiclass

Pass a 2-D `(n, n_classes)` probability matrix and integer labels. Metrics use
*top-label* (confidence) calibration: `confidence = max_c p`, `correct =
1[argmax == label]` — the standard multiclass ECE setting.

```python
probs, labels = cs.make_multiclass(3000, n_classes=5, temperature=2.0)
cs.calibration_report(probs, labels)
```

## What's implemented (and verified against references)

| Area | Functions | Verified against |
|---|---|---|
| Binned errors | `ece`, `ace`, `mce`, `debiased_ece`, `calibration_error` | hand-computed cases |
| Proper scores | `brier_score`, `brier_decomposition`, `nll` | scikit-learn `brier_score_loss`, `log_loss`; Murphy identity |
| Uncertainty | `bootstrap_ci`, `bootstrap_metrics`, `CIResult` | coverage + `1/√n` shrinkage |
| Diagrams | `reliability_diagram`, `reliability_data` | Wilson score interval |
| Recalibration | `TemperatureScaler`, `PlattScaler` | known-T recovery; sklearn `LogisticRegression` |
| Shift | `compare_subgroups`, `ece_difference_test` | known-gap detection / null calibration |

Full numerical detail and figures: [**STUDY.md**](STUDY.md).

## Design notes

- **Binary convention.** `confidence = probs`, `accuracy = label` — i.e. the
  reliability curve of *predicted P(y=1)* vs *observed frequency*, which is more
  informative than the `max(p, 1−p)` collapse.
- **Binning bias.** Equal-width ECE is the default for comparability with the
  literature; `ace` uses equal-mass bins; `debiased_ece` corrects the
  small-sample inflation. The right move is usually to report all three.
- **Holdout discipline.** Recalibrators are fit/transform objects so you cannot
  accidentally fit and evaluate on the same data.

## What real model probabilities would add

This toolkit is exercised here on **synthetic** predictors — deliberately, since
that gives a *known* ground-truth calibration to validate the estimators against.
The code path for real `(probability, label)` arrays is identical. Plugging in
real LLM or classifier outputs would extend the study in ways synthetic data
cannot fully mimic:

- **Confidence mass piled near 1.0.** Real classifiers (and LLM token
  probabilities) put most of their mass in the top bin, so equal-width ECE is
  dominated by one bin while equal-mass (ACE) and the bias correction matter far
  more — exactly the regime where a bare ECE is most misleading.
- **Genuine distribution shift.** Replacing the tuned-temperature subgroups with
  real in-/out-of-domain slices (e.g. a model evaluated on a new dataset) would
  show miscalibration that *isn't* a single global temperature — where
  temperature scaling helps only partially and the subgroup tests earn their keep.
- **Class imbalance and rare events**, where the Brier *uncertainty* term and the
  reliability/resolution split become the interesting story.
- **Keyless public sources** of `(prob, label)` (released model logits, public
  probabilistic-forecast archives) would let the study use real outcomes; the API
  needs no changes to ingest them.

None of these require new metrics — they are the same estimators on heavier-tailed,
shifted data, which is precisely the setting where reporting `ECE ± CI` instead of
a bare ECE changes the conclusion.

## Development

```bash
uv sync --extra viz
uv run pytest                 # 33 tests
uv run ruff check . && uv run ruff format --check .
uv run mypy src/calibstats
uv run python study/run_study.py
```

## License

MIT — see [LICENSE](LICENSE).

## References

- Naeini, Cooper & Hauskrecht (2015), *Obtaining Well Calibrated Probabilities Using Bayesian Binning* (ECE).
- Guo, Pleiss, Sun & Weinberger (2017), *On Calibration of Modern Neural Networks* (temperature scaling, reliability diagrams).
- Nixon et al. (2019), *Measuring Calibration in Deep Learning* (adaptive ECE).
- Kumar, Liang & Ma (2019), *Verified Uncertainty Calibration* (debiased estimators).
- Murphy (1973), *A New Vector Partition of the Probability Score* (Brier decomposition).
- Platt (1999), *Probabilistic Outputs for Support Vector Machines* (Platt scaling).

---

*`calibstats` is part of a statistical-rigor-for-AI-evals toolkit: [deltagate](https://github.com/yongzhe2160cs/eval-reliability) (paired-delta validation for eval comparisons), [agentrel](https://github.com/yongzhe2160cs/agent-eval-reliability) (reliability stats for stochastic agent evals), [leaderboard-ci](https://github.com/yongzhe2160cs/leaderboard-reliability) (leaderboard re-ranking with CIs and tie bands). Full portfolio: [github.com/yongzhe2160cs](https://github.com/yongzhe2160cs).*
