Metadata-Version: 2.4
Name: ehrextract
Version: 0.2.0
Summary: Structured feature extraction from clinical notes
Author: Chen Zhang, Yibing Xia, Sanjay Mahant, Nathan Taback
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/shifosss/ehrextract
Project-URL: Repository, https://github.com/shifosss/ehrextract
Project-URL: Issues, https://github.com/shifosss/ehrextract/issues
Project-URL: Changelog, https://github.com/shifosss/ehrextract/blob/main/CHANGELOG.md
Keywords: clinical,nlp,extraction,llm,structured-output
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: pandas>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: openpyxl>=3.1
Provides-Extra: hf
Requires-Dist: torch>=2.0; extra == "hf"
Requires-Dist: transformers>=4.56; extra == "hf"
Requires-Dist: peft>=0.10; extra == "hf"
Requires-Dist: accelerate>=0.30; extra == "hf"
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.30; extra == "anthropic"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Dynamic: license-file

# ehrextract

Structured feature extraction from clinical notes. Three steps:

1. **Bring your notes** — CSV, JSONL, JSON, XLSX, plain text, or a pandas
   DataFrame.
2. **Pick a task** — a built-in task (`comorbidity`, `clinical_vars`, `full`)
   or your own YAML file with your own fields and prompt.
3. **Pick a model** — a fine-tuned LoRA adapter on a local base model, your
   own local HuggingFace weights, or an API model (OpenAI-compatible or
   Anthropic).

One command (or one function call) later you have a results table —
CSV, JSONL, JSON, XLSX, or Parquet — with one column per extracted field.

> **Important — read before use.**
> ehrextract is **research-grade software**. It is **NOT a medical device**,
> is **NOT FDA-cleared / Health Canada-approved**, and **MUST NOT** be used
> for clinical decision-making, patient triage, eligibility determination,
> re-identification, surveillance, or any setting where its outputs affect
> a person's access to care, insurance, employment, or legal status.
> Outputs may hallucinate; any research use requires per-row human review.
> The egress-warning system is informational, not a privacy compliance
> control. **Users are solely responsible for HIPAA / PHIPA / PIPEDA / GDPR
> / REB compliance.** See [`NOTICE`](https://github.com/shifosss/ehrextract/blob/main/NOTICE) for the
> full acceptable-use scope.

## Install

Until the PyPI release, install from source (**current method**):

```bash
git clone https://github.com/shifosss/ehrextract
pip install './ehrextract[hf]'          # or [openai], [anthropic]
```

Once published to PyPI:

```bash
pip install ehrextract                  # core (~50 MB)
pip install 'ehrextract[hf]'            # + torch + transformers + peft (~3 GB)
pip install 'ehrextract[openai]'        # + openai SDK
pip install 'ehrextract[anthropic]'     # + anthropic SDK
```

## 30-second example

```bash
ehrextract \
  --task comorbidity \
  --model Qwen/Qwen3.5-27B --adapter /path/to/adapter \
  --input notes.csv --output results.csv
```

or, as a library:

```python
from pathlib import Path
from ehrextract import extract

df = extract(
    Path("notes.csv"),
    "comorbidity",
    model="Qwen/Qwen3.5-27B",
    adapter="/path/to/adapter",
    output="results.csv",
)
```

The input needs a `note_text` column (configurable via `--text-column`); a
`note_id` column is added automatically when absent. The output has one
column per task field plus `parse_success`, `validation_errors`,
`raw_response`, `finish_reason`, and token counts.

## Built-in tasks

| Task | Fields | What it extracts |
|---|---|---|
| `comorbidity` | 17 | Free-text diagnosis list + 16 Y/N comorbidity categories |
| `clinical_vars` | 4 | Feeding and neurologic variables (tube/oral feeding, aspiration risk, NI trajectory) |
| `full` | 20 | Joint task: the 16 comorbidity categories + the 4 clinical variables |

Built-in tasks ship inside the package; `--task <name>` works without any
extra files. Define your own task in YAML — see
[`schema-reference.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/schema-reference.md).

> **Note on the `full` task.** The research pipeline that produced the
> published evaluation numbers for the joint 20-field task used constrained
> JSON decoding to force the output shape. ehrextract v0.2.0 does **not**
> constrain decoding (planned as a future feature), so `full`-task outputs
> can diverge from the published numbers on hard notes — watch the
> `parse_success` and `validation_errors` columns.

## Data handling

If your input may contain PHI, read [`data-handling.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/data-handling.md)
BEFORE running with any API provider. The package writes a data-egress
notice to stderr (once per process per destination) on API use; it never
blocks, and it does not (and cannot) guarantee compliance for you. The
local HuggingFace provider keeps all data on your machine.

## Documentation

- [`quickstart.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/quickstart.md) — fine-tuned adapters, custom tasks, API providers
- [`schema-reference.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/schema-reference.md) — the task-file YAML reference
- [`data-handling.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/data-handling.md) — PHI, egress notice, BAA-eligible providers
- [`extending-providers.md`](https://github.com/shifosss/ehrextract/blob/main/docs/ehrextract/extending-providers.md) — plug in a custom provider

## Authors and institutions

ehrextract was developed by:

- **Chen Zhang** (lead author)
- **Yibing Xia** (co-author)
- **Sanjay Mahant, MD** -- supervisor, The Hospital for Sick Children (SickKids)
- **Nathan Taback, PhD** -- supervisor, University of Toronto

at **The Hospital for Sick Children** (Toronto, Canada) and the
**University of Toronto** (Toronto, Canada). Please cite the project if
you use it in published work.

## License

Licensed under the Apache License, Version 2.0. See [`LICENSE`](https://github.com/shifosss/ehrextract/blob/main/LICENSE)
for the full license text and [`NOTICE`](https://github.com/shifosss/ehrextract/blob/main/NOTICE) for
attribution, the no-endorsement clause, the clinical-use disclaimer, and the
acceptable-use restrictions that supplement (but do not override) the License.
