Metadata-Version: 2.4
Name: fmm-fairness-eval
Version: 0.1.0
Summary: SaMD-specific fairness evaluation CLI for foundation-model medical AI; emits AI Act Art. 10 / Art. 9 evidence artifacts
Author: César Pereiro
License: MIT
Project-URL: Homepage, https://github.com/Ces107/fmm-fairness-eval-cli
Project-URL: Repository, https://github.com/Ces107/fmm-fairness-eval-cli
Project-URL: Issues, https://github.com/Ces107/fmm-fairness-eval-cli/issues
Project-URL: Changelog, https://github.com/Ces107/fmm-fairness-eval-cli/blob/main/CHANGELOG.md
Keywords: fairness,samd,medical-ai,foundation-models,eu-ai-act,histopathology,dermatology,conch,bias,equal-opportunity,inter-hospital,multi-site
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Dynamic: license-file

# fmm-fairness-eval

> SaMD-specific fairness evaluation CLI for foundation-model medical AI. Emits AI Act Art. 10 / Art. 9 evidence artifacts.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

`fmm-fairness-eval` (`fmm-fairness` on the command line) is a small, focused CLI that takes a predictions CSV from any SaMD or SaMD-adjacent medical-AI model and produces a regulator-friendly fairness evidence pack: a Markdown report + a machine-readable JSON pack + a SHA-256 audit chain. It is built around the failure mode regulators actually care about — **inter-hospital / inter-site bias** — and packaged so that the output drops straight into an EU AI Act Art. 10 / Art. 9 dossier.

---

## Why this exists

Modern medical-AI systems are increasingly built on **foundation-model embeddings** (CONCH for histopathology, DINOv2 for general radiology, RadFM-style models for multi-modal radiology) plus a small downstream classifier. The dominant failure mode is no longer "the model is biased against women" or "the model misses dark skin tones" in isolation — it is **inter-hospital generalization collapse**: the model that scores F1=0.89 on one cohort drops to F1=0.70 on the cohort across the river, and the gap is largest in subgroups the training data under-represented.

The author's TFG (Universitat Politècnica de València, 2024) measured exactly this on dermatology AI using CONCH embeddings and multiple-instance learning over the AI4SkIN cohort: weighted F1 = 0.89, but an inter-hospital fairness gap of 0.19 between sites. That is not a corner case — it is the modal failure mode for any SaMD that crosses a hospital network boundary.

Existing fairness libraries (FairLearn, AIF360, Holistic AI, Microsoft Responsible AI Toolbox) are general-purpose ML fairness tools. **None of them ship a SaMD-specific evaluation pipeline** that:

- Treats `site` / `hospital` as a first-class protected attribute distinct from individual demographics.
- Emits AI Act Art. 10 / Art. 9 cross-cited evidence by default.
- Defines a composite SaMD fairness score whose weighting reflects how regulators actually prioritize bias categories.
- Ships a SHA-256 audit chain so the evidence pack is tamper-evident the moment it leaves your pipeline.

This tool fills that gap. Nothing more, nothing less.

Citation for the underlying TFG work: César Pereiro, _Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset_, Universitat Politècnica de València, 2024. https://riunet.upv.es/handle/10251/226903

---

## Install

```bash
pip install fmm-fairness-eval
```

Or from source:

```bash
git clone https://github.com/<handle>/fmm-fairness-eval
cd fmm-fairness-eval
pip install -e .
```

Requires Python 3.10+, numpy ≥ 1.24, pandas ≥ 2.0, scikit-learn ≥ 1.3. No GPU dependency.

---

## What it does

### 1. Run an evaluation

```bash
fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --site-attribute site \
    --output fairness-report/
```

`predictions.csv` must contain these columns:

| column     | type                              | meaning                                            |
|------------|-----------------------------------|----------------------------------------------------|
| `y_true`   | int ∈ {0, 1}                      | Ground-truth label                                  |
| `y_pred`   | int ∈ {0, 1}                      | Thresholded prediction                              |
| `y_score`  | float ∈ [0, 1]                    | Raw model probability / score                       |
| (declared) | str                               | One column per `--protected-attrs` value            |

### 2. Read the output

The CLI produces three files in `fairness-report/`:

- `fairness-report.md` — human-readable, regulator-friendly summary.
- `fairness-evidence.json` — machine-readable evidence pack (stable schema, sorted keys for deterministic SHA).
- `audit.sha256` — SHA-256 of the above two files; pin in your QMS / change-control record.

### 3. Cross-cite to the AI Act

```bash
fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --manifest-mode ai-act \
    --output fairness-report/
```

In `ai-act` mode the JSON pack gains a `regulatory_mapping` block that cross-cites each metric to the EU AI Act article it evidences:

- **Art. 9 (Risk management system)** ↔ `samd_fairness_score`, `inter_site_auc_variance`.
- **Art. 10 (Data and data governance)** ↔ `equal_opportunity_gap`, `demographic_parity_gap`, `calibration_gap` (evidences Art. 10(2)(f-g) examination of biases and shortcomings).
- **Art. 15 (Accuracy, robustness)** ↔ `inter_site_auc_variance` (evidences generalization claims).

---

## Metrics computed

| Metric                       | Formula (short)                              | When it matters                                       |
|------------------------------|----------------------------------------------|-------------------------------------------------------|
| `equal_opportunity_gap`      | max-min TPR across groups (Hardt et al. 2016) | Under-diagnosis disparity (Pierson et al. 2021)       |
| `demographic_parity_gap`     | max-min P(ŷ=1) across groups                  | Selection-rate disparity                              |
| `calibration_gap`            | max-min ECE across groups                     | Score-trust differs by subgroup                       |
| `inter_site_auc_variance`    | Var(AUC) across sites                         | Inter-hospital generalization risk (the SaMD failure mode) |
| `samd_fairness_score`        | composite ∈ [0,1] (see `docs/samd-fairness-score.md`) | Single-number summary for QMS dashboards    |

All gap metrics ship with **percentile bootstrap 95% CIs** computed over a stratified resample.

The composite `samd_fairness_score` is defined explicitly with documented weights and a sensitivity analysis in [`docs/samd-fairness-score.md`](docs/samd-fairness-score.md). It is **not** a black box and is **not** an FDA-blessed metric — it is a transparent aggregate the operator can defend, override, or replace.

---

## Scientific context

- **CONCH** (Lu et al. 2024) is the visual-language pathology foundation model used in the underlying TFG work. Lu, M. Y. et al. "A visual-language foundation model for computational pathology." *Nature Medicine* 30, 863–874 (2024). [doi:10.1038/s41591-024-02856-4](https://doi.org/10.1038/s41591-024-02856-4)
- **AI4SkIN** is the multi-hospital dermatopathology dataset (Spain, multi-site) on which the TFG measured the 0.19 inter-hospital gap.
- **Under-diagnosis bias on chest X-rays** (Seyyed-Kalantari et al. 2021, *Nat. Med.* 27, 2176-2182) is the canonical demonstration that single-site fairness audits miss the dominant failure mode.
- **Pain disparity reduction** (Pierson et al. 2021, *Nat. Med.* 27, 136-140) demonstrates the inverse — that algorithmic predictions can outperform human-graded severity in capturing real disparities, motivating better measurement, not less.
- **Ethical implementation** (Char, Shah, Magnus 2018, *NEJM* 378, 981-983) sets the still-canonical framing for healthcare-ML ethics.

A one-paragraph literature pointer for the formal definitions: the equal-opportunity criterion is Hardt, Price, Srebro (NeurIPS 2016); the demographic-parity definition follows Dwork et al. (ITCS 2012); calibration-by-group follows Pleiss et al. (NeurIPS 2017). The composite weighting is justified from FDA GMLP guidance (2021, updated 2024 IMDRF GMLP) + EU AI Act Art. 10 prioritisation of multi-site data governance.

---

## What it does NOT do

- **Not** a model-training framework. Bring your own predictions.
- **Not** a foundation-model serving stack. Embeddings are outside scope.
- **Not** auto-detection of protected attributes. You must declare them — silent attribute inference is itself a bias risk.
- **Not** a certification. A fairness evaluation is evidence; certification is a regulatory process this tool helps you prepare for.
- **Not** an explainability tool. It surfaces *where* bias lives, not *why*.

---

## Honest scientific caveats (read before quoting numbers)

1. **Threshold sensitivity.** `equal_opportunity_gap` and `demographic_parity_gap` both depend on the operating threshold used to produce `y_pred`. Re-run the evaluation at any threshold you would actually deploy at.
2. **Small-sample bootstrap.** Percentile bootstrap is approximate for small groups; for n < 50 prefer the BCa interval or treat CIs as exploratory. Groups with n < `min_group_n` (default 20) are excluded with a warning rather than silently producing a near-zero gap.
3. **Prevalence confound.** Inter-site bias is frequently confounded with prevalence shift. A site that sees twice the disease prevalence will have different TPR even from a perfectly fair model. The tool reports both per-site AUC (less prevalence-sensitive) and per-site rates; interpret jointly.
4. **Composite score is opinionated.** The default weights (`w_site=0.4, w_eo=0.3, w_dp=0.15, w_cal=0.15`) reflect this author's read of regulatory priority. Override with `weights=` in the Python API or treat the components separately. The single number is for dashboards; the components are for decisions.
5. **No causal inference.** A measured gap does not identify the mechanism. Combine with subgroup analysis, training-set provenance audit, and (where possible) prospective evaluation.

---

## Pricing

- **CLI**: MIT, free, forever.
- **Hosted "fairness CI" (Phase 2 — not yet shipped)**: planned at €99/month for teams that want every commit to a model repo to fire an evaluation against a frozen multi-site cohort and post the evidence pack as a CI artifact. Mailing list opens at validation green-light.
- **Consulting**: the author is available for SaMD fairness review / AI Act Art. 10 evidence-pack design at €60-100/hour. Contact via the linked GitHub profile; introductions through the academic-DM channel are welcome.

---

## Citing

If you use this tool in published research:

```bibtex
@software{pereiro2026fmmfairness,
  author       = {Pereiro, C{\'e}sar},
  title        = {{fmm-fairness-eval}: SaMD-specific fairness evaluation for foundation-model medical AI},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {<assigned-on-first-release>},
  url          = {https://github.com/<handle>/fmm-fairness-eval}
}

@thesis{pereiro2024dermfairness,
  author = {Pereiro, C{\'e}sar},
  title  = {Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset},
  school = {Universitat Polit{\`e}cnica de Val{\`e}ncia},
  year   = {2024},
  url    = {https://riunet.upv.es/handle/10251/226903}
}
```

---

## Roadmap

- v0.1 (this release): CLI, 4 fairness gap metrics, composite score, AI Act manifest mode, SHA-256 audit chain.
- v0.2: BCa bootstrap, sub-group intersectionality (`site × sex`), CSV-of-CSVs batch mode.
- v0.3: HTML report option, hosted fairness-CI (Phase 2 — gated on validation pass).
- v0.4: subgroup-aware threshold optimisation (opt-in, with the appropriate caveats).

---

## License

MIT. See [LICENSE](LICENSE).
