Metadata-Version: 2.4
Name: doubletfinder-py
Version: 1.1.0
Summary: Python port of R DoubletFinder for scRNA-seq doublet detection
Author: dam2452
License-Expression: MIT
Project-URL: Repository, https://github.com/dam2452/pydoubletfinder
Project-URL: Issues, https://github.com/dam2452/pydoubletfinder/issues
Keywords: doublet-detection,scrna-seq,single-cell,bioinformatics,seurat,doubletfinder
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=2.0
Requires-Dist: pandas>=2.2
Requires-Dist: scanpy>=1.10
Requires-Dist: scipy>=1.14
Requires-Dist: scikit-learn>=1.5
Requires-Dist: statsmodels>=0.14
Provides-Extra: loess
Requires-Dist: scikit-misc>=0.5; extra == "loess"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: scikit-misc>=0.5; extra == "dev"
Dynamic: license-file

<h1 align="center">pyDoubletFinder</h1>

<p align="center">
  <strong>Faithful Python port of the R DoubletFinder algorithm for scRNA-seq doublet detection</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/doubletfinder-py/"><img src="https://img.shields.io/pypi/v/doubletfinder-py.svg" alt="PyPI"/></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"/></a>
  <img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+"/>
</p>

---

**pyDoubletFinder** is a line-by-line Python port of the R [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) algorithm, designed as a drop-in replacement for projects using [scanpy](https://scanpy.readthedocs.io/) / [AnnData](https://anndata.readthedocs.io/) without requiring an R environment. Replicates the exact Seurat preprocessing pipeline including LogNormalize, VST, ScaleData, full Euclidean distance matrix, and pANN scoring.

## Features

- **Line-by-line port** - replicates the exact R DoubletFinder algorithm
- **Native VST** - reimplementation of Seurat v3's `FindVariableFeatures(method="vst")` on raw counts
- **R-matching loess** - uses `scikit-misc` (degree=2) to match R's `stats::loess` exactly
- **Full preprocessing pipeline** - LogNormalize, VST, ScaleData, PCA, distance matrix, pANN
- **94.3% classification agreement** with R on matched data (4926 cells)
- **99.5% HVG overlap** confirms faithful VST reproduction
- **Parameter sweep** - `param_sweep_and_summarize()` for automatic pK selection via bimodality coefficient
- **SCTransform approximation** - experimental support via Pearson residuals
- **scanpy / AnnData native** - no R dependencies required

## Installation

```bash
pip install doubletfinder-py
```

For exact R-matching loess (recommended):

```bash
pip install "doubletfinder-py[loess]"
```

This installs `scikit-misc` which provides `skmisc.loess` — a degree-2 loess matching R's `stats::loess`. Without it, the library falls back to `statsmodels.lowess` (degree-1, local linear), which is a close but not identical approximation.

## Quick Start

```python
import scanpy as sc
from pydoubletfinder import doublet_finder, model_homotypic

adata = sc.read_10x_h5("sample.h5")
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.layers["counts"] = adata.X.copy()

pK   = 0.09
nExp = int(0.075 * adata.n_obs)

# Optional: adjust for homotypic doublets
homo_prop = model_homotypic(adata.obs["cell_type"].values)
nExp = int(nExp * (1 - homo_prop))

adata = doublet_finder(adata, PCs=10, pK=pK, nExp=nExp)

col_class = f"DF.classifications_0.25_{pK}_{nExp}"
print(adata.obs[col_class].value_counts())
```

For pK tuning, annotations, reuse and sparse data see **[docs/usage.md](docs/usage.md)**.

## Gallery

<table>
  <tr>
    <td align="center"><b>pANN Distribution</b></td>
    <td align="center"><b>PC Selection</b></td>
    <td align="center"><b>Multi-Sample Batch</b></td>
  </tr>
  <tr>
    <td><img src="https://raw.githubusercontent.com/dam2452/pydoubletfinder/main/example_output/4_pann_distribution.svg" width="300"/></td>
    <td><img src="https://raw.githubusercontent.com/dam2452/pydoubletfinder/main/example_output/6_pc_selection.svg" width="300"/></td>
    <td><img src="https://raw.githubusercontent.com/dam2452/pydoubletfinder/main/example_output/10_multi_sample.svg" width="300"/></td>
  </tr>
</table>

## Examples

10 runnable scripts covering all features — see **[docs/examples.md](docs/examples.md)** for the full list with previews.

```bash
cd examples && python generate_all.py
```

### Automatic pK selection

```python
from pydoubletfinder import param_sweep_and_summarize

sweep_df = param_sweep_and_summarize(adata, PCs=10)
best_pK  = float(sweep_df.loc[sweep_df["BCreal"].idxmax(), "pK"])
```

Note: the parameter sweep is computationally expensive. For most datasets, a fixed `pK=0.09` is a reasonable starting point.

## Benchmark vs R

Tested on snRNA-seq mouse EAM data (sample42, D0, 4926 cells) using identical doublet pairs (same random seed exported from R):

| Metric | Value |
|---|---|
| Classification agreement | **94.32%** |
| pANN Pearson r | 0.8236 |
| pANN Spearman r | 0.8477 |
| HVG overlap (VST) | 1990 / 2000 (99.5%) |
| Cohen's κ | 0.5899 |

### Where the ~6% discrepancy comes from

| Source | Impact | Details |
|---|---|---|
| PCA solver | ~5.5% | R uses `irlba` (Seurat), Python uses ARPACK (`scanpy.tl.pca`) |
| HVG selection (VST) | ~0.5% | 10 different genes out of 2000 — negligible |

The ~6% discrepancy is a fundamental property of the port — R's `irlba` and Python's SVD solvers use different numerical paths. All 280 cells classified differently (140 in each direction of the confusion matrix) have pANN values within ~0.01 of the decision threshold. **No solver swap can reliably fix this without reimplementing irlba line-for-line in Python.**

## API

### `doublet_finder(adata, PCs, pK, nExp, pN=0.25, ...)`

Core doublet prediction function. Adds two columns to `adata.obs`:
- `pANN_{pN}_{pK}_{nExp}` — doublet score (proportion of artificial nearest neighbours)
- `DF.classifications_{pN}_{pK}_{nExp}` — `"Singlet"` or `"Doublet"`

**Parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `adata` | `AnnData` | — | Input object. Raw counts in `adata.layers["counts"]`, `adata.raw.X`, or `adata.X`. |
| `PCs` | `int` or `list[int]` | — | Number of PCs or list of 1-based PC indices. |
| `pK` | `float` | — | Neighbourhood proportion for pANN computation. |
| `nExp` | `int` | — | Expected number of doublets (classification threshold). |
| `pN` | `float` | `0.25` | Proportion of artificial doublets to generate. |
| `reuse_pANN` | `str` or `None` | `None` | Existing `adata.obs` column with precomputed pANN — skips heavy computation. |
| `sct` | `bool` | `False` | Use SCTransform-like normalisation (experimental). |
| `annotations` | `array` or `None` | `None` | Cell-type labels. Adds `DF.doublet.contributors_*` columns. |
| `scale_factor` | `float` | `1e4` | Target sum for normalisation. |
| `n_top_genes` | `int` | `2000` | Number of HVGs for VST. |
| `loess_span` | `float` | `0.3` | Span for loess in VST. |
| `scale_max` | `float` | `10` | Clip value for ScaleData. |
| `random_state` | `int` | `0` | PCA seed. |

### `model_homotypic(annotations)`

Estimates the proportion of homotypic doublets from cell type annotations. Returns `sum(p_i^2)` where `p_i` is the proportion of cell type `i`. Replicates R's `modelHomotypic`.

### `param_sweep_and_summarize(adata, PCs, ...)`

Runs a pN–pK parameter sweep and returns a `DataFrame` with columns `pN`, `pK`, `BCreal` (bimodality coefficient). Select the `pK` that maximises `BCreal`.

## Differences from R DoubletFinder

| Aspect | R | Python |
|---|---|---|
| Normalisation | `NormalizeData` (Seurat) | `sc.pp.normalize_total` + `sc.pp.log1p` |
| HVG selection | `FindVariableFeatures(method="vst")` | Native reimplementation (`_seurat_vst`) |
| Scaling | `ScaleData` (Seurat) | `sc.pp.scale` |
| PCA | `irlba` via `RunPCA` | ARPACK via `sc.tl.pca` |
| Distance matrix | `fields::rdist` | `scipy.spatial.distance.cdist` |
| Loess (VST) | `stats::loess` (degree=2) | `skmisc.loess` (degree=2) or `statsmodels.lowess` fallback |

## Benchmarks

To reproduce the benchmark comparing this implementation against R DoubletFinder:

```bash
SAMPLE_H5=/path/to/sample.h5 bash benchmarks/benchmark.sh
```

Requires Docker. On first run, builds an image with R 4.4 + Seurat + Python (~10 min). Subsequent runs reuse the cached image.

Results are written to `benchmarks/results/`:
- `comparison_report.txt` — full metrics summary
- `plots/pann_scatter.png` — pANN correlation scatter
- `plots/pann_hist.png` — pANN distribution overlay
- `plots/confusion.png` — classification confusion matrix
- `plots/hvg_overlap.png` — HVG overlap bar chart

## Citation

If you use **pyDoubletFinder** in a publication, please cite both this package and the original DoubletFinder paper:

**APA:**

> dam2452. (2026). pyDoubletFinder: Python port of the R DoubletFinder algorithm (Version 1.0.0). https://github.com/dam2452/pydoubletfinder

> McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. *Cell Systems*, 8, 329–337.e4. https://doi.org/10.1016/j.cels.2019.03.003

**BibTeX:**

```bibtex
@software{pydoubletfinder2026,
  title   = {pyDoubletFinder: Python port of the R DoubletFinder algorithm},
  author  = {dam2452},
  year    = {2026},
  version = {1.0.0},
  url     = {https://github.com/dam2452/pydoubletfinder}
}

@article{mcginnis2019doubletfinder,
  title     = {{DoubletFinder}: Doublet Detection in Single-Cell {RNA} Sequencing Data Using Artificial Nearest Neighbors},
  author    = {McGinnis, Christopher S. and Murrow, Lydia M. and Gartner, Zev J.},
  journal   = {Cell Systems},
  volume    = {8},
  number    = {4},
  pages     = {329--337.e4},
  year      = {2019},
  doi       = {10.1016/j.cels.2019.03.003}
}
```

## Contributing

Contributions are welcome! Here's how you can help:

1. **Bug reports** - Open an issue with a minimal reproducible example
2. **Feature requests** - Open an issue describing the use case
3. **Code contributions** - Fork, create a feature branch, and open a pull request

### Development setup

```bash
git clone https://github.com/dam2452/pydoubletfinder.git
cd pydoubletfinder
pip install -e ".[dev]"
pytest tests/
```

## License

This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for full details.

## Reference

McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. *Cell Systems* 8, 329–337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003
