Metadata-Version: 2.4
Name: phenocluster
Version: 0.3.0
Summary: Clinical Phenotype Discovery using Latent Class / Profile Analysis with Automatic Model Selection
Author-email: Ettore Rocchi <ettore.rocchi3@unibo.it>
Maintainer-email: Ettore Rocchi <ettore.rocchi3@unibo.it>
License: MIT
Project-URL: Homepage, https://github.com/EttoreRocchi/phenocluster
Project-URL: Documentation, https://ettorerocchi.github.io/phenocluster
Project-URL: Repository, https://github.com/EttoreRocchi/phenocluster
Project-URL: Bug Tracker, https://github.com/EttoreRocchi/phenocluster/issues
Project-URL: Changelog, https://github.com/EttoreRocchi/phenocluster/blob/main/docs/changelog.rst
Keywords: clinical,phenotype,clustering,latent-class-analysis,latent-profile-analysis,machine-learning,bioinformatics,healthcare,data-science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=2.1.0
Requires-Dist: scikit-learn>=1.8.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: pyyaml>=5.4.0
Requires-Dist: joblib>=1.3.0
Requires-Dist: stepmix>=3.0.0
Requires-Dist: statsmodels>=0.14.0
Requires-Dist: plotly>=5.3.0
Requires-Dist: typer>=0.9.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: lifelines>=0.30.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: sphinx-click>=5.0; extra == "docs"
Requires-Dist: furo; extra == "docs"
Provides-Extra: dashboard
Requires-Dist: streamlit>=1.30; extra == "dashboard"
Requires-Dist: watchdog; extra == "dashboard"
Dynamic: license-file

<p align="center">
  <img src="docs/phenocluster_logo.png" alt="PhenoCluster" width="280"/>
</p>

<p align="center">
  <strong>A flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis</strong>
</p>

[![PyPI version](https://img.shields.io/pypi/v/phenocluster)](https://pypi.org/project/phenocluster/)
[![Python versions](https://img.shields.io/pypi/pyversions/phenocluster)](https://www.python.org/downloads/)
[![MIT License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/EttoreRocchi/phenocluster/actions/workflows/ci.yml/badge.svg)](https://github.com/EttoreRocchi/phenocluster/actions/workflows/ci.yml)
[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://ettorerocchi.github.io/phenocluster)

---

## Overview

PhenoCluster is a Python framework for unsupervised discovery of clinical phenotypes from heterogeneous patient data. It implements an end-to-end pipeline: from data preprocessing and latent class identification to outcome association analysis, survival modelling, and multistate transition modelling.

The framework is **domain-agnostic** and can be applied to any clinical cohort study where the goal is to identify latent patient subgroups and characterise their relationship with clinical outcomes. Users supply a dataset and a YAML configuration file; PhenoCluster handles model selection, phenotype assignment, and downstream inference automatically.

### Key capabilities

- **Latent Class / Profile Analysis** via the [StepMix](https://github.com/Labo-Lacourse/stepmix) framework with native support for mixed continuous/categorical data and missing values
- **Automatic model selection** using information criteria (BIC, AIC, ICL, CAIC, SABIC) with configurable cluster-size constraints
- **Classification quality assessment** with per-phenotype Average Posterior Probability (AvePP) and assignment confidence metrics
- **Outcome association analysis** with logistic regression yielding odds ratios, confidence intervals, and FDR-corrected p-values
- **Survival analysis** with Cox proportional hazards models producing hazard ratios and log-rank tests
- **Multistate modelling** with transition-specific Cox PH analysis, Monte Carlo simulation for state occupation probabilities with confidence interval bands, and clinical pathway enumeration
- **Temporal and multi-site generalizability** (v0.3.0) - validate phenotypes across time windows or sites/centers (cutoff, sliding/expanding windows, leave-one-site-out), with apply-only or refit-and-match modes, calibration metrics (Brier, ECE), drift detection (PSI, KS, chi-square), and per-phenotype OR/HR concordance with FDR-corrected delta tests
- **Optional Streamlit dashboard** (v0.3.0) for interactive exploration of saved results: `phenocluster dashboard <results_dir>`
- **Comprehensive output** including an interactive HTML report (toggleable via `generate_html_report` or `--no-html-report`), forest plots with confidence intervals, Kaplan-Meier and Nelson-Aalen curves, heatmaps, and JSON/CSV data exports

## Installation

> **Requires Python >= 3.11**

```bash
pip install phenocluster
```

To enable the optional interactive dashboard:

```bash
pip install 'phenocluster[dashboard]'
```

## Quick start

### 1. Generate a configuration file

```bash
phenocluster create-config -p complete -o config.yaml
```

### 2. Edit the configuration

Open `config.yaml` and fill in your dataset-specific parameters:

```yaml
global:
  project_name: "My Study"
  output_dir: "results"
  random_state: 42

data:
  continuous_columns:
    - age
    - bmi
    - lab_value_1
  categorical_columns:
    - sex
    - smoking_status
    - disease_stage
  split:
    test_size: 0.2

outcome:
  enabled: true
  outcome_columns:
    - mortality_30d
    - readmission_30d

survival:
  enabled: true
  targets:
    - name: "overall_survival"
      time_column: "time_to_death"
      event_column: "death_indicator"
```

### 3. Run the pipeline

```bash
phenocluster run -d data.csv -c config.yaml
```

### 4. Inspect results

Results are written to the output directory (default: `results/`):

| File | Description |
|------|-------------|
| `analysis_report.html` | Comprehensive HTML report (skip with `generate_html_report: false` or `--no-html-report`) |
| `cluster_statistics.json` | Phenotype sizes, feature distributions, and classification quality |
| `outcome_results.json` | Odds ratios with confidence intervals and p-values |
| `survival_results.json` | Kaplan-Meier estimates and Cox PH hazard ratios |
| `multistate_results.json` | Transition-specific hazard ratios, pathways, and state occupation |
| `data/model_fit_metrics.csv` | Information criteria, entropy, and average posterior probabilities |
| `data/phenotypes_data.csv` | Original data augmented with phenotype assignments |
| `data/posterior_probabilities.csv` | Posterior class membership probabilities |
| `results/model_selection_summary.json` | Model selection comparison table and best model info |
| `results/feature_importance.json` | Feature characterisation per phenotype |
| `results/validation_report.json` | Internal validation metrics (train/test comparison) |
| `results/stability_results.json` | Consensus clustering stability metrics |
| `results/split_info.json` | Train/test split details |
| `results/external_validation_results.json` | External validation results (when enabled) |
| `results/temporal_validation_results.json` | Temporal generalizability results (when enabled, v0.3.0) |
| `results/multisite_validation_results.json` | Multi-site (LOGO / holdout) generalizability results (v0.3.0) |
| `results/external_cohorts_results.json` | External-CSV generalizability results (v0.3.0) |
| `results/generalizability_summary.json` | Aggregate ARI / PSI summary across cohorts plus `training_scope` flag (v0.3.0) |
| `data/generalizability/` | Per-cohort `cluster_distribution_<label>.csv` and `drift_<label>.csv` (v0.3.0) |
| `phenocluster.log` | Pipeline execution log |
| `artifacts/` | Cached intermediate results for incremental re-runs |

### 5. Validate phenotypes across time or sites (v0.3.0)

Add a `generalizability` block to the config to enable temporal, multi-site, and/or external-CSV validation. The default `training_scope: per_split` fits a fresh preprocessor and StepMix model on the derivation rows of each in-CSV split and applies it to the validation rows. The pipeline's full-cohort model stays untouched for descriptive analyses.

```yaml
generalizability:
  enabled: true
  training_scope: per_split          # per_split (default) | global
  feature_selector_scope: auto       # auto (default) | global | per_split
  refit: true                        # refit-and-match Hungarian alignment
  min_validation_size_for_refit: 100
  temporal:
    time_column: admission_date
    scheme: cutoff                   # cutoff | fraction | sliding | expanding
    time_cutoff: "2020-12-31"
  multisite:
    site_column: center
    scheme: logo                     # logo | holdout | pairwise
    min_site_size: 30
  external_cohorts:                  # optional, one or more separate CSVs
    - { path: ./cohort_B.csv, label: hospital_X, kind: site }
    - { path: ./cohort_2024.csv, label: era_2024, kind: temporal }
  drift:        { enabled: true, n_bins: 10, top_k: 20 }
  calibration:  { enabled: true, n_bins: 10, strategy: quantile }
  outcome_concordance: { enabled: true, fdr_method: bh, alpha: 0.05 }
```

Each cohort yields a phenotype distribution, drift table, refit-and-match metrics (ARI / NMI / Hungarian-matched accuracy), calibration metrics, and per-phenotype OR/HR concordance with FDR-corrected delta tests. Cohort reports also expose a `fit_mode` field (`per_split` for in-CSV splits under the default scope; `global` for external CSVs and the legacy permissive path) and `derivation_only_ari` showing how the fresh derivation-only fit compares to the global model.

### 6. Explore results interactively (v0.3.0)

```bash
pip install 'phenocluster[dashboard]'
phenocluster dashboard ./results/
```

Streamlit launches at `http://127.0.0.1:8501` with tabs for an Overview, Phenotypes, Outcomes, Survival, Multistate, Generalizability, and a per-cohort Drift explorer.

## Pipeline overview

PhenoCluster executes the following stages in order:

1. **Data quality assessment.** Missingness patterns, correlations, variance, and MCAR testing.
2. **Train/test split.** Stratified splitting with configurable test size, performed before preprocessing to prevent data leakage.
3. **Preprocessing.** Imputation, outlier handling, categorical encoding, standardization, and feature selection -- fit on training data only, then applied to the test set.
4. **Model selection.** Cross-validated information criterion search over cluster counts (training set only).
5. **Full-cohort refit.** Once K is selected, preprocessing and LCA/LPA model are refitted on the entire cohort; phenotypes reordered by size (largest = Phenotype 0).
6. **Stability analysis.** Consensus clustering over subsampled runs.
7. **Internal validation.** Train/test log-likelihood comparison, cluster proportion stability, and outcome OR consistency.
8. **Outcome association.** Logistic regression for binary outcomes with FDR-corrected p-values (optional).
9. **Survival analysis.** Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios (optional).
10. **Multistate modelling.** Transition-specific Cox PH models, transition hazard ratios, and Monte Carlo simulation (optional).
11. **Temporal / multi-site generalizability.** Re-evaluate the derivation phenotypes on later time windows, held-out sites, and external CSVs; report ARI / NMI / matched accuracy, calibration, drift, and OR/HR concordance (optional, v0.3.0).
12. **Report generation.** Interactive HTML report with all figures and tables.

## CLI reference

| Command | Description |
|---------|-------------|
| `phenocluster run -d DATA -c CONFIG [--force-rerun] [-v] [-q] [--html-report/--no-html-report]` | Run the full pipeline |
| `phenocluster create-config [-p PROFILE] [-o OUTPUT]` | Generate a config YAML from a profile template |
| `phenocluster validate-config -c CONFIG [-d DATA]` | Validate config structure; cross-check columns against data |
| `phenocluster list-profiles` | List available configuration profile templates |
| `phenocluster show-profile NAME` | Print the resolved YAML for a profile with syntax highlighting |
| `phenocluster dashboard RESULTS_DIR [--port 8501] [--host 127.0.0.1] [--headless/--browser]` | Launch the optional Streamlit dashboard (requires `pip install 'phenocluster[dashboard]'`) |
| `phenocluster version` | Show version, repository link, and documentation link |

## Configuration profiles

Profiles set sensible defaults for common use-cases. Generate one with `phenocluster create-config -p <profile>`:

| Profile | Description | Inference | Stability | Multistate |
|---------|-------------|:---------:|:---------:|:----------:|
| `descriptive` | Phenotype discovery only, no statistical inference | off | on | off |
| `complete` | All analyses enabled (outcomes, survival, multistate) | on | on | on |
| `quick` | Fast iteration for development | on | off | off |

## Configuration reference

See the full [Configuration Reference](https://ettorerocchi.github.io/phenocluster/configuration.html) in the documentation.

## Documentation

Full documentation (statistical methods, configuration reference, output descriptions) is available at **[ettorerocchi.github.io/phenocluster](https://ettorerocchi.github.io/phenocluster)**.

## License

This project is licensed under the [MIT](LICENSE) License.

## Citation

If you use **PhenoCluster** in your research, please cite:

```bibtex
Available soon.
```

## Acknowledgment

This project relies on **StepMix**, a Python package for pseudo-likelihood estimation of generalized mixture models with external variables. We thank the authors for making their work openly available.

If you use this framework, please cite also:

> Morin, S., Legault, R., Laliberté, F., Bakk, Z., Giguère, C.-É., de la Sablonnière, R., & Lacourse, É. (2025). StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables. Journal of Statistical Software, 113(8), 1-39. doi: [10.18637/jss.v113.i08](https://doi.org/10.18637/jss.v113.i08)
