Metadata-Version: 2.4
Name: cadence-core
Version: 1.1.0
Summary: Neural model for next clinical event prediction from EHR sequences using the Narrative Velocity framework
Author-email: Amir Rouhollahi <arouhollahi@bwh.harvard.edu>
License: MIT
Project-URL: Homepage, https://amirrouh.github.io/cadence/
Project-URL: Documentation, https://amirrouh.github.io/cadence/
Project-URL: Repository, https://github.com/amirrouh/cadence
Project-URL: Bug Tracker, https://github.com/amirrouh/cadence/issues
Keywords: clinical prediction,EHR,electronic health records,deep learning,MIMIC-IV
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: hdbscan>=0.8.41
Requires-Dist: joblib>=1.3.0
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.4.3
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas>=3.0.1
Requires-Dist: pyarrow>=23.0.1
Requires-Dist: scikit-learn>=1.4.0
Requires-Dist: sentence-transformers>=5.2.3
Requires-Dist: torch>=2.0.0
Requires-Dist: umap-learn>=0.5.11
Requires-Dist: wfdb>=4.3.1
Requires-Dist: xgboost>=3.2.0

# cadence-core

[![PyPI version](https://img.shields.io/pypi/v/cadence-core.svg)](https://pypi.org/project/cadence-core/)
[![Python](https://img.shields.io/pypi/pyversions/cadence-core.svg)](https://pypi.org/project/cadence-core/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![GitHub release](https://img.shields.io/github/v/release/amirrouh/cadence?label=release)](https://github.com/amirrouh/cadence/releases/tag/v1.0.0)
[![bioRxiv](https://img.shields.io/badge/bioRxiv-10.64898%2F2026.05.06.722409-b31b1b)](https://doi.org/10.64898/2026.05.06.722409)

**cadence-core** is a pretrained neural model for next clinical event prediction from electronic health record (EHR) sequences. Given a patient's longitudinal clinical history, it predicts which of 48 clinical event categories will occur next and how many days until that event.

**[Documentation & Paper →](https://amirrouh.github.io/cadence/)**

---

## Key Features

- **5.86M parameter residual MLP** — lightweight, fast inference, no GPU required
- **Trained on MIMIC-IV v3.1** — 100k patient sequences from a large academic medical center
- **Joint prediction** — simultaneous 48-class event classification and time-to-event regression
- **34.18% top-1 accuracy, 36.95 days MAE** — outperforms XGBoost and all evaluated baselines
- **Self-knowledge distillation** — improved generalization without external teacher models
- **Auto-downloads checkpoint** — model weights fetched from GitHub Releases on first use
- **Drop-in inference** — three lines of code from install to prediction

---

## Installation

```bash
pip install cadence-core
```

Requires Python 3.10+. No GPU needed for inference.

---

## Quick Start

```python
import torch
from cadence import CadenceModel, load_checkpoint

# Load model and pretrained weights (checkpoint auto-downloads on first run)
model = CadenceModel()
load_checkpoint(model)
model.eval()

# Input: 2420-dimensional feature vector per patient visit
# [0:884]    — 884 Narrative Velocity (NV) clinical features
# [884:1652] — 768-dim PubMedBERT mean-pooled cluster-semantic embedding
# [1652:2420] — 768-dim PubMedBERT last-token cluster-semantic embedding
x = torch.randn(1, 2420)  # batch_size=1, feature_dim=2420

with torch.no_grad():
    logits, time_bins = model(x)

# logits    : (batch, 48)  — classification logits over 48 event categories
# time_bins : (batch, 19)  — regression logits over 19 discretized time bins
event_probs = torch.softmax(logits, dim=-1)
top1_event  = event_probs.argmax(dim=-1).item()
print(f"Predicted next event class : {top1_event}")
print(f"Top-1 probability          : {event_probs.max().item():.3f}")
```

---

## Model Architecture

cadence-core implements the **Narrative Velocity Composite (NV-C)** framework — a residual MLP that fuses structured clinical features with contextual language embeddings.

| Component | Details |
|-----------|---------|
| **Input dimension** | 2420 (884 NV features + 768 PubMedBERT mean + 768 PubMedBERT last) |
| **Backbone** | 3-block MLP with residual skip connections and LayerNorm |
| **Classification head** | Linear → 48 event-class logits |
| **Regression head** | Linear → 19-bin discretized time-to-event logits |
| **Parameters** | 5.86M |
| **Training objective** | Cross-entropy (classification) + ordinal regression loss (time), with self-KD |

The 884 NV features capture structured clinical signals (labs, vitals, medications, procedures) encoded as narrative velocity trajectories. PubMedBERT embeddings are **cluster-semantic embeddings** — [`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) encodings of event-category labels (not raw clinical note text) — frozen at inference. Self-knowledge distillation applied after PubMedBERT cluster-semantic fusion yields a disproportionately large top-1 gain (+0.81 pp), substantially exceeding the gain from self-KD on structured features alone.

---

## Performance

### 100k Training Tier — Male Cohort (MIMIC-IV v3.1)

Results are 3-seed means with bootstrap 95% CIs. XGBoost falls outside Cadence's CI on both metrics.

| Model | Top-1 Accuracy | MAE (days) |
|-------|:--------------:|:----------:|
| **cadence-core (NV-C)** | **34.18%** [33.84%, 34.42%] | **36.95** [36.10, 37.68] |
| XGBoost | 32.35% | 38.58 |
| Random Forest | 24.1% | 53.2 |
| Logistic Regression | 21.3% | 58.7 |
| RETAIN (baseline) | 22.8% | 54.1 |
| Majority-class baseline | 9.25% | — |
| Random baseline | 2.08% | — |

![Cadence vs baselines — 100k training tier, MIMIC-IV v3.1](https://raw.githubusercontent.com/amirrouh/cadence/main/docs/static/figures/fig2_comparison.png)

### Full-Cohort Results (MIMIC-IV v3.1)

At full cohort, cadence-core leads all models on top-1 accuracy. FT-Transformer achieves the best MAE.

| Cohort | Model | Top-1 Accuracy | MAE (days) |
|--------|-------|:--------------:|:----------:|
| Male | **cadence-core (NV-C)** | **38.04%** | 29.39 |
| Male | FT-Transformer | — | **27.82** |
| Female | **cadence-core (NV-C)** | **35.66%** | 39.88 |
| Female | FT-Transformer | — | **37.08** |

### External Validation — BWH Dataset (1,120 patients)

External validation on de-identified records from Brigham and Women's Hospital (BWH) — a geographically and demographically distinct population with missing structured features and population shift. BWH events were LLM-extracted and mapped to the MIMIC-IV 48-cluster event schema.

| Model | Top-1 Accuracy |
|-------|:--------------:|
| RETAIN | **20.98%** (best overall) |
| **cadence-core (NV-C)** | **11.88%** (leads structured-feature models) |

Under domain shift with missing structured features, RETAIN achieves the best overall top-1 on BWH. Cadence leads among structured-feature models.

---

## Paper & Citation

> **Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV**
> Rouhollahi A. and Nezami F.R. — *bioRxiv*, 2026. [doi.org/10.64898/2026.05.06.722409](https://doi.org/10.64898/2026.05.06.722409)

If you use cadence-core in your research, please cite:

```bibtex
@article{rouhollahi2026cadence,
  title   = {Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in {MIMIC-IV}},
  author  = {Rouhollahi, Amir and Nezami, Farhad R.},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.05.06.722409},
  url     = {https://doi.org/10.64898/2026.05.06.722409}
}
```

---

## Train on Your Own Data

Starting with v1.1.0, cadence-core ships a complete training pipeline. Provide your
own JSONL sequences and per-event embeddings; no MIMIC data is required.

### Input format

Each line in your JSONL files is one patient prediction record:

```json
{
  "patient_id": "patient_001",
  "history": [
    {
      "date_iso": "2019-03-15",
      "event_index": 42,
      "cluster_id": 7,
      "days_from_start": 0.0
    },
    {
      "date_iso": "2019-04-01",
      "event_index": 17,
      "cluster_id": 3,
      "days_from_start": 17.0
    }
  ],
  "target": {
    "cluster_id": 12,
    "days_from_prev": 14.0
  }
}
```

- `event_index`: row index (0-based) into your `embeddings.npy` file.
- `cluster_id`: integer in `[0, n_clusters-1]` representing the event category.
- `days_from_start`: days since the first event in this patient's history window.
- `days_from_prev`: regression target -- days between the last history event and the target event.

### Embeddings

Provide a NumPy array of shape `(N_events, emb_dim)` -- one row per unique event
in your dataset. Any sentence embedding works: PubMedBERT, BERT, domain-specific
encoders, etc. The `emb_dim` can be any size (768, 512, 32, ...).

Pair it with an `event_index.json` file -- a JSON array where element `i` identifies
the patient and event for row `i` of `embeddings.npy`:

```json
[
  {"subject_id": "patient_001", "event_index": 42},
  {"subject_id": "patient_001", "event_index": 17},
  ...
]
```

### Training

```python
import cadence

classifier = cadence.train(
    train_jsonl="my_data/train.jsonl",
    val_jsonl="my_data/val.jsonl",
    embeddings_path="my_data/embeddings.npy",
    event_index_path="my_data/event_index.json",
    n_clusters=50,
    out_dir="./runs/my_run",
    n_epochs=30,
)
```

### Inference

```python
preds = cadence.predict(
    classifier,
    "my_data/test.jsonl",
    embeddings_path="my_data/embeddings.npy",
    event_index_path="my_data/event_index.json",
)
# preds is a list of dicts:
# [{"patient_id": "...", "top_3_clusters": [7, 3, 12],
#   "top_3_probs": [0.42, 0.31, 0.18], "days_until_next": 14.2}, ...]
```

### Feature dimensions (public training path)

The public training path uses `5*n_clusters + max_history + 20 + 2*emb_dim` input
features. For `n_clusters=50`, `max_history=10`, `emb_dim=768`: 1806 dims. The paper
checkpoint uses 2420 dims (884 base + 768 + 768); the extra 614 base dims require
MIMIC-specific structured/temporal preprocessing pipelines not available publicly.
The public model uses the same NVCClean architecture and training schedule (Phase 1
classification + Phase 2 joint cls+reg + SWA + MixUp + ASL + Gaussian soft targets).

---

## Reproducibility

Data access requires a signed PhysioNet credentialed account for MIMIC-IV:

```
https://physionet.org/content/mimiciv/3.1/
```

Once access is granted, follow the preprocessing instructions in `src/` to generate the NV feature sequences and PubMedBERT embeddings used for training.

---

## License

This project is released under the [MIT License](https://opensource.org/licenses/MIT). The pretrained model checkpoint is provided for research use only. MIMIC-IV data is subject to its own [PhysioNet Credentialed Health Data License](https://physionet.org/content/mimiciv/view-license/3.1/).

---

## Contact

**Amir Rouhollahi**
Brigham and Women's Hospital / Harvard Medical School
[arouhollahi@bwh.harvard.edu](mailto:arouhollahi@bwh.harvard.edu)
[GitHub](https://github.com/amirrouh/cadence) · [PyPI](https://pypi.org/project/cadence-core/)
