Metadata-Version: 2.4
Name: carve-validate
Version: 1.0.0
Summary: Cluster Analysis with Resampling for Validation and Exploration: stability- and generalizability-based clustering validation.
Author-email: Kai Wycik <kai.wycik@columbia.edu>
License-Expression: MIT
Project-URL: Homepage, https://github.com/DataSlingers/CARVE
Project-URL: Repository, https://github.com/DataSlingers/CARVE
Project-URL: Issues, https://github.com/DataSlingers/CARVE/issues
Keywords: clustering,validation,stability,consensus,scRNA-seq,scikit-learn
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: <3.13,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.1,>=2.0
Requires-Dist: scipy<1.17,>=1.16
Requires-Dist: scikit-learn<1.8,>=1.7.2
Requires-Dist: pandas<2.3,>=2.2
Requires-Dist: joblib<1.6,>=1.5
Requires-Dist: tqdm<5,>=4.67
Requires-Dist: numba<0.61,>=0.60
Requires-Dist: llvmlite<0.44,>=0.43
Requires-Dist: umap-learn>=0.5.6
Requires-Dist: pynndescent<0.6,>=0.5.10
Requires-Dist: matplotlib>=3.9.4
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Provides-Extra: notebooks
Requires-Dist: jupyterlab<5,>=4.2; extra == "notebooks"
Requires-Dist: ipykernel<7,>=6.29; extra == "notebooks"
Requires-Dist: ipywidgets>=8.1.7; extra == "notebooks"
Requires-Dist: plotly>=6.3.0; extra == "notebooks"
Requires-Dist: rpy2>=3.6.3; extra == "notebooks"
Requires-Dist: pyreadr>=0.5.3; extra == "notebooks"
Requires-Dist: scanpy>=1.11.4; extra == "notebooks"
Requires-Dist: scikit-misc>=0.5.1; extra == "notebooks"
Requires-Dist: anywidget>=0.9.18; extra == "notebooks"
Requires-Dist: igraph>=1.0.0; extra == "notebooks"
Requires-Dist: leidenalg>=0.11.0; extra == "notebooks"
Requires-Dist: pyteomics>=4.7.5; extra == "notebooks"
Requires-Dist: lxml>=6.0.2; extra == "notebooks"
Requires-Dist: scvi>=0.6.8; extra == "notebooks"
Requires-Dist: cmap>=0.7.0; extra == "notebooks"
Requires-Dist: glasbey>=0.3.0; extra == "notebooks"
Dynamic: license-file

[![CI](https://github.com/DataSlingers/CARVE/actions/workflows/ci.yml/badge.svg)](https://github.com/DataSlingers/CARVE/actions/workflows/ci.yml)
[![R CMD check](https://github.com/DataSlingers/CARVE/actions/workflows/r-ci.yml/badge.svg)](https://github.com/DataSlingers/CARVE/actions/workflows/r-ci.yml)
![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)
![Version](https://img.shields.io/badge/version-1.0.0-orange.svg)

# CARVE

**Cluster Analysis with Resampling for Validation and Exploration**

Choosing the number of clusters is hard, especially for high-dimensional biological data where standard internal clustering validation indices (CVIs) are often unreliable. CARVE measures clustering robustness through two resampling-based concepts: **stability** (reproducibility of cluster assignments under data subsampling) and **generalizability** (agreement between held-out cluster labels and predictions from a classifier trained on a subsample of the data). CARVE reports global, cluster-level, and sample-level diagnostics with visualizations, all through a scikit-learn-compatible API.

<p align="center">
  <img src="https://raw.githubusercontent.com/DataSlingers/CARVE/main/carve_overview.png" width="700" alt="CARVE overview">
</p>

## Features

- Scikit-learn-compatible API: `CARVE` extends `BaseEstimator` with a `fit` / `get_labels` / `get_k` workflow
- Stability (ARI on subsample overlap) and generalizability (ARI on held-out predictions) metrics
- Diagnostics at the global, per-cluster, and per-sample level
- Metrics: stability and generalizability ARIs, consensus PAC, Gini, cross-entropy, and predictive accuracy
- Selection rules: `max`, `1se` (one-standard-error), and `quantile`
- Custom spectral clustering with self-tuning affinity (based on Zelnik-Manor & Perona, *Self-Tuning Spectral Clustering*, NeurIPS 2004)
- Plots: metric-over-*k* curves, consensus heatmaps, box plots, violin plots, and scatter plots
- Parallel resampling via joblib (`n_jobs`)

## Installation

CARVE requires **Python 3.12**.

```bash
pip install carve-validate
```

The distribution is named `carve-validate`; the import name is `carve`:

```python
from carve import CARVE
```

### From source (development)

```bash
git clone https://github.com/DataSlingers/CARVE.git
cd CARVE
pip install -e ".[dev]"        # linting + testing
pip install -e ".[notebooks]"  # Jupyter, Scanpy, scVI, etc.
```

## Quick Start

```python
from carve import CARVE
from sklearn.datasets import make_blobs

# Generate synthetic data
X, y_true = make_blobs(n_samples=500, n_features=10, centers=5, random_state=42)

# Fit CARVE
carve = CARVE(n_clusters=10, n_resamples=120, subsample_ratio=0.7, n_jobs=4)
carve.fit(X)

# Select best k and retrieve labels
k = carve.get_k(measure="generalizability", rule="1se")
labels = carve.get_labels(measure="generalizability", rule="1se")
print(f"Selected k={k}")
```

See [`notebooks/Tutorial.ipynb`](notebooks/Tutorial.ipynb) for a walkthrough, and [`notebooks/case_studies/`](notebooks/case_studies/) for real-world analyses on scRNA-seq and mass cytometry datasets.

## Visualization

```python
# Metric curves across k
carve.plot_metric_over_n_clusters(measure="stability", rule="1se")

# Consensus heatmap for the selected solution
carve.plot_consensus_matrix(measure="generalizability", rule="1se")

# Per-cluster stability violin plot
carve.plot_cluster_violin(source="gini", measure="generalizability", rule="1se")

# 2D scatter with score-encoded marker size and opacity
carve.plot_cluster_scatter(source="gini", measure="generalizability", rule="1se")
```

All plotting methods return a matplotlib `Axes` object and accept `save` and `dpi` parameters for export.

## Citation

If you use CARVE in your research, please cite:

> Wycik, K. R., Tang, T. M., Zikry, T. M., & Allen, G. I. (2026). *CARVE: Cluster
> Analysis with Resampling for Validation and Exploration.* Zenodo.
> https://doi.org/10.5281/zenodo.20448965

```bibtex
@software{wycik2026carve,
  author    = {Wycik, Kai R. and Tang, Tiffany M. and Zikry, Tarek M. and Allen, Genevera I.},
  title     = {{CARVE}: Cluster Analysis with Resampling for Validation and Exploration},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20448965},
  url       = {https://doi.org/10.5281/zenodo.20448965}
}
```

## Authors

- [Kai R. Wycik](mailto:kai.wycik@columbia.edu) — Columbia University
- Tiffany M. Tang — University of Notre Dame
- Tarek M. Zikry — UNC Chapel Hill
- Genevera I. Allen — Columbia University

## Contributing

Contributions are welcome! Please open an [issue](https://github.com/DataSlingers/CARVE/issues) or submit a pull request.

This project uses [Ruff](https://docs.astral.sh/ruff/) for linting and formatting, and [pytest](https://docs.pytest.org/) for testing:

```bash
ruff check src/       # lint
ruff format src/      # format
pytest -v             # run tests
```
