Metadata-Version: 2.4
Name: ehds-anon-kit
Version: 0.1.0
Summary: EHDS-Article-cited anonymization toolkit for secondary-use health data (FHIR + tabular)
Project-URL: Homepage, https://github.com/plusultra/ehds-anon-kit
Project-URL: Repository, https://github.com/plusultra/ehds-anon-kit
Project-URL: Issues, https://github.com/plusultra/ehds-anon-kit/issues
Author-email: plusUltra <ops@plusultra.dev>
License: MIT
License-File: LICENSE
Keywords: anonymization,ehds,eu-regulation,fhir,health-data,pseudonymization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: tabular
Requires-Dist: pandas>=2.0; extra == 'tabular'
Description-Content-Type: text/markdown

# ehds-anon-kit

**EHDS-Article-cited anonymization for secondary-use health data.**

A Python CLI that de-identifies FHIR R4 bundles and tabular EHR data for
[Regulation (EU) 2025/327 (EHDS)](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32025R0327)
Chapter IV secondary-use data permits — and emits a manifest that cites the exact Article
and Recital mandating each transformation.

[![CI](https://github.com/plusultra/ehds-anon-kit/actions/workflows/test.yml/badge.svg)](https://github.com/plusultra/ehds-anon-kit/actions)
[![PyPI](https://img.shields.io/pypi/v/ehds-anon-kit)](https://pypi.org/project/ehds-anon-kit/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

---

## What and why

Regulation (EU) 2025/327, Chapter IV (Art. 64-72), OJ 2025-03-05, establishes the EHDS
secondary-use framework. Commission implementing acts specifying technical anonymization
standards for HealthData@EU are expected H1-H2 2026. Health data access bodies (HDABs)
are already processing permit applications under the existing Article text.

Existing open-source tools (`synthea`, `ARX`, academic libraries) do not:
- Emit a per-transformation regulatory citation tied to EHDS Art. 64-72
- Implement the Art. 72 pseudonymisation key custody chain of evidence
- Target the HealthData@EU secondary-use submission workflow

`ehds-anon-kit` fills that gap. Every transformation is traceable to its legal basis.

---

## Install

```bash
pip install ehds-anon-kit
```

With tabular (CSV) support:

```bash
pip install "ehds-anon-kit[tabular]"
```

---

## Quickstart

```bash
ehds-anon \
  --fhir-bundle data/bundle.json \
  --profile ehds-secondary-default \
  --key-custody key-custody.yaml \
  --out output/
```

With tabular data:

```bash
ehds-anon \
  --fhir-bundle data/bundle.json \
  --tabular data/patients.csv \
  --profile ehds-secondary-default \
  --key-custody key-custody.yaml \
  --out output/
```

`key-custody.yaml` (choose one key source):

```yaml
# Option 1: environment variable (recommended)
env_var: EHDS_PSEUDO_KEY

# Option 2: HashiCorp Vault
# vault_path: vault://ehds-keys/patient-key

# Option 3: in-process (triggers Art. 72 warning — disclose to HDAB)
# inline_key: "your-secret-key"
```

---

## Outputs

| File | Description |
|------|-------------|
| `bundle_anon.json` | Anonymized FHIR R4 bundle |
| `tabular_anon.csv` | k-anonymized EHR table (if `--tabular` given) |
| `ehds_evidence.json` | Machine-readable EHDS Art. 64-72 evidence manifest |
| `ehds_evidence.md` | Human-readable manifest for DPO / HDAB submission |
| `audit.sha256` | Tamper-evident hash chain over all inputs + outputs |

---

## Anonymization profiles

| Profile | k-anonymity | Date-shift | Postal code | Target use |
|---------|-------------|-----------|-------------|-----------|
| `ehds-secondary-default` | k=5 | ±90 days | 3 chars (NUTS-3) | Most EHDS Chapter IV permits |
| `ehds-research-strict` | k=10 | ±180 days | 2 chars | High-sensitivity / HealthData@EU cross-border |

---

## FHIR transformations (with citations)

| Resource | Field | Action | Citation |
|----------|-------|--------|----------|
| Patient | `identifier` | Replace with pseudonym | Art. 72; Rec. 66 |
| Patient | `name` | Remove | Art. 65; Rec. 65 |
| Patient | `birthDate` | Truncate to year | Art. 65; Rec. 71 |
| Patient | `address` | Generalise to 3-char postal | Art. 65 |
| Observation | `effectiveDateTime` | Date-shift ±90d | Rec. 71 |
| Encounter | `period` | Date-shift ±90d | Rec. 71 |
| Encounter | `participant.individual` | Pseudonymise practitioner | Art. 65 |

See `docs/ehds-citation-map.md` for the full transformation-to-Article mapping.

---

## Art. 72 key custody

| Key source | Art. 72 disclosure required |
|-----------|---------------------------|
| `hsm://...` | No — hardware isolation |
| `vault://...` | No — isolated vault |
| `env:VAR` | No — operator-managed |
| inline | **YES** — must disclose to HDAB |

The key source and custody chain are recorded in `ehds_evidence.json`.

---

## Known gaps

These limitations are documented honestly. The tool is an MVP targeting the most
common EHDS secondary-use use case.

1. **Parquet not implemented**: tabular anonymization reads/writes CSV only. Parquet
   support requires `pyarrow` or `fastparquet` and is planned for v0.2.
2. **HSM/Vault stubs only**: `hsm://` and `vault://` key sources emit a warning and
   fall back to a placeholder key. Full PKCS#11 and Vault integration is planned for v0.2.
3. **FHIR resource coverage**: only Patient, Observation, and Encounter are de-identified.
   Other resource types (Condition, MedicationRequest, DiagnosticReport, etc.) are
   passed through unchanged.
4. **No differential privacy**: the tool does not implement DP-style noise injection.
5. **No t-closeness**: only k-anonymity and l-diversity are reported for tabular data.
6. **Commission implementing acts pending**: the Art. 65-72 implementing acts specifying
   exact technical standards are expected H2 2026. All citations in `data/ehds_text.yaml`
   are marked `excerpt_type: paraphrase`; the tool will be updated when implementing
   acts are published in the OJ.
7. **Not a legal determination**: this tool produces an engineering evidence artifact.
   It does not constitute a formal GDPR anonymization determination. Review by a DPO
   or legal counsel is required before HDAB submission.

---

## Citations

Regulation (EU) 2025/327 of the European Parliament and of the Council of
12 February 2025 on the European Health Data Space. Official Journal of the
European Union, L 2025/327, 5 March 2025.
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32025R0327

---

## License

MIT. See [LICENSE](LICENSE).

---

## Contributing

Issues and PRs welcome. Before contributing, please:
1. Run `ruff check src/ tests/` and `mypy --strict src/`
2. Ensure `pytest` passes with no failures
3. Reference the relevant EHDS Article in any citation-related change
