Metadata-Version: 2.4
Name: tinyehr
Version: 0.2.0
Summary: A tiny EHR dataset for learning, prototyping, and building consisting 100 patients de identified data
Author: Vidul Ayakulangara Panickan
License-Expression: MIT
Project-URL: Homepage, https://tinyehr.org
Project-URL: Repository, https://github.com/vidulpanickan/TinyEHR
Project-URL: Dataset, https://huggingface.co/datasets/vidulpanickan/TinyEHR
Keywords: ehr,mimic,omop,clinical,healthcare,dataset,medicine,agents,ai,machine learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: pyarrow>=10.0.0
Provides-Extra: minimal
Requires-Dist: pandas>=1.3.0; extra == "minimal"
Dynamic: license-file

# TinyEHR : A Tiny Electronic Health Records Dataset for Learning, Prototyping, and Building

TinyEHR is a small, open, reproducible clinical dataset with 100 patients available in two formats - MIMIC and OMOP. It is derived from the [MIMIC-IV Clinical Database Demo v2.2](https://physionet.org/content/mimic-iv-demo/2.2/), the publicly available subset of MIMIC-IV published by the MIT Laboratory for Computational Physiology.

Openly available, no credentialing or data use agreements required. Install and start exploring clinical data in seconds.

| | |
|---|---|
| Website | [tinyehr.org](https://tinyehr.org) |
| GitHub | [github.com/vidulpanickan/TinyEHR](https://github.com/vidulpanickan/TinyEHR) |
| HuggingFace | [datasets/vidulpanickan/TinyEHR](https://huggingface.co/datasets/vidulpanickan/TinyEHR) |
| PyPI | `pip install tinyehr` |

## Install

```bash
pip install tinyehr
```

## Python API

```python
import tinyehr

# Quick reference of all functions
tinyehr.help()

# Overview of all tables with row counts
tinyehr.info()
tinyehr.info(format="tinyehr_omop_format")

# List table names
tinyehr.list_tables()
tinyehr.list_tables(format="tinyehr_omop_format")

# Column names, types, and sample rows for a table
tinyehr.describe_table("patients")
tinyehr.describe_table("person", format="tinyehr_omop_format")

# Find tables by keyword in table and column names
tinyehr.search_tables("lab")
tinyehr.search_tables("drug")

# Load a table as a pandas DataFrame
patients = tinyehr.load_table("patients")
person = tinyehr.load_table("person", format="tinyehr_omop_format")

# All data for one patient across all tables
data = tinyehr.get_patient(10000032)
data["admissions"]    # DataFrame of this patient's admissions
data["labevents"]     # DataFrame of this patient's labs
data["noteevents"]    # DataFrame of this patient's notes

# Build a local SQLite database
db_path = tinyehr.build_sqlite(format="tinyehr_mimic_format")
db_path = tinyehr.build_sqlite(format="tinyehr_omop_format")

# Query the SQLite database
import sqlite3
conn = sqlite3.connect(db_path)
conn.execute("SELECT * FROM admissions LIMIT 5").fetchall()
```

## Direct from HuggingFace

```python
import pandas as pd

patients = pd.read_parquet(
    "hf://datasets/vidulpanickan/tinyehr/tinyehr_mimic_format/patients.parquet"
)
```

No dependencies beyond `pandas` and `pyarrow`.

## Trouble downloading?

You can download the raw CSV files directly from GitHub:

1. Go to [github.com/vidulpanickan/TinyEHR](https://github.com/vidulpanickan/TinyEHR)
2. Click the green **Code** button
3. Select **Download ZIP**

Or clone via terminal:

```bash
git clone https://github.com/vidulpanickan/TinyEHR.git
```

## Formats

TinyEHR ships in two formats from the same underlying patient cohort:

**MIMIC format** follows the original MIMIC-IV schema with dates shifted to realistic years, ICD codes reformatted with decimal points, and 4,580 clinical notes generated using LLM  based on patient visit profiles.

**OMOP format** follows the OHDSI CDM v5.3.1 schema with hashed person IDs, dates shifted to realistic years, and clinical codes mapped to standardized medical vocabularies. ICD codes in `source_value` fields are stored without decimal points, following the OMOP billing/claims convention.

For full dataset structure, schema documentation, and table details, visit [About The Data](https://github.com/vidulpanickan/TinyEHR/blob/main/ABOUT_THE_DATA.md).

## Differences from MIMIC-IV Demo

TinyEHR applies four targeted transformations to the original MIMIC-IV Demo data. All clinical values, patient demographics, table structures, referential integrity, and row counts are unchanged.

| Transformation | What changed | Why |
|---------------|-------------|-----|
| **Date shifting** | All dates shifted from synthetic 2100+ range to realistic 2010s-2020s using per-patient offsets derived from `anchor_year_group`. Affects 21 MIMIC tables and 15 OMOP tables. Offsets saved in `metadata/date_offsets.csv`. | Realistic dates for prototyping. |
| **ICD code formatting** (MIMIC only) | Decimal points inserted into ICD codes (`E119` → `E11.9`, `3961` → `39.61`). ICD-10-PCS codes left unchanged. OMOP `source_value` fields are not modified (preserves billing/claims format). | Matches real-world clinical code formatting. |
| **Clinical notes** | 4,580 notes across 14 types, generated from each patient's profile during their hospital visit including demographics, diagnoses, and admission data. Added as `noteevents` (MIMIC) and `note` (OMOP). | The original Demo has no clinical notes. |
| **OMOP note concepts** | 19 note-related concepts added to `2b_concept.csv` (10 Note Type, 7 LOINC Document Ontology, 2 utility). Row count: 3,885 - 3,904. | Required for OMOP note table concept references. |

## What's New (v0.2.0)

- **Parquet type enforcement**: all column types now match the official MIMIC-IV and OMOP CDM DDL schemas exactly (integers, floats, timestamps, strings)
- **CSV type enforcement**: when loading CSVs without pyarrow, types are applied from bundled DDL schema files instead of relying on pandas auto-inference
- **OMOP source values**: no longer formatted with decimal points, preserving the billing/claims convention
- **ICD-9 procedure codes**: decimal point now correctly placed after 2nd digit (`3961` → `39.61`)
- **Clinical notes**: regenerated from patient profiles with correct admission dates

## License

- **Code** (this Python package): MIT License
- **Data** (the TinyEHR dataset): ODbL-1.0
