Metadata-Version: 2.4
Name: scRBP
Version: 0.1.0
Summary: Single-cell RNA Binding Protein Regulon Inference
Author: Yunlong Ma
License: MIT License
        
        Copyright (c) 2024 scRBP Development Team
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/mayunlong89/scRBP
Project-URL: Bug Tracker, https://github.com/mayunlong89/scRBP/issues
Keywords: single-cell,RNA-binding protein,gene regulatory network,scRNA-seq,bioinformatics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: anndata
Requires-Dist: scanpy
Requires-Dist: polars
Requires-Dist: pyarrow
Requires-Dist: loompy
Requires-Dist: geosketch
Requires-Dist: arboreto
Requires-Dist: ctxcore
Requires-Dist: pyscenic
Requires-Dist: tqdm
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: license-file

# scRBP — Single-cell RNA Binding Protein regulon inference

**scRBP** is a command-line toolkit for comprehensive analysis of RNA-binding proteins (RBPs) in single-cell RNA-seq data. scRBP provides a systematic, scalable and integrative framework to infer RBP-mediated gene and isoform regulatory networks (“regulons”) from single-cell transcriptomes and prioritize networks underlying complex genetic traits and disorders. scRBP is comprised of six main modules: (i) developing a comprehensive compendium of RBPs and their associated motif clusters from diverse public resources; (ii) systematic, motif-guided transcriptome-wide inference of RBP targets at both gene- and isoform-level resolution; (iii) construction of RBP-gene and/or RBP-isoform co-expression networks from short- or long-read single-cell transcriptomic data, respectively; (iv) defining high-fidelity regulons by integrating RBP-target interactions, and quantifying cell type-specific regulon activity scores (RAS); (v) integrating GWAS results to compute regulon-level genetic association scores (RGS); and (vi) constructing a unified trait-relevance score (TRS) by combining RAS and RGS for each regulon in a given cellular context, with statistical significance assessed using Monte Carlo (MC) sampling.

---

## What scRBP Does

RBPs are key post-transcriptional regulators that control mRNA splicing, stability, and translation. scRBP enables you to:

- **Construct** which RBPs regulate which genes or isoforms in your single-cell data
- **Prune** raw RBP–gene associations using motif-binding evidence to obtain high-confidence regulons
- **Score** each cell or cell type for regulon activity score (RAS) using the AUCell algorithm
- **Link** RBP regulons to human disease through GWAS genetic enrichment (RGS via MAGMA)
- **Integrate** RAS and RGS into a unified Trait Relevance Score (TRS) that ranks disease-relevant RBPs

---

## Pipeline at a Glance

```
Raw single-cell data (.h5ad / .feather)
          │
          ▼
[Step 1]  scRBP getSketch        ── Stratified GeoSketch cell downsampling
          │
          ▼
[Step 2]  scRBP getGRN           ── GRNBoost2/GENIE3 RBP→Gene/Isoform inference
          │                          (run N seeds for robustness, default 30 times)
          ▼
[Step 3]  scRBP getMerge_GRN     ── Merge N-seed GRNs → consensus network
          │
          ▼
[Step 4]  scRBP getModule        ── Extract regulon candidates (Top-N / percentile)
          │
          ▼
[Step 5]  scRBP getPrune         ── Motif-enrichment pruning via ctxcore
          │
          ▼
[Step 6]  scRBP getRegulon       ── Export pruned regulons to GMT format
          │
          ▼
[Step 7]  scRBP mergeRegulons    ── Merge region-specific GMT files
          │                          (3'UTR / 5'UTR / CDS / Introns)
          ▼
[Step 8]  scRBP ras              ── Regulon Activity Score (AUCell) per cell / cell type
          │
          ▼
[Step 9]  scRBP rgs              ── Regulon Gene-Set analysis (MAGMA GWAS enrichment)
          │
          ▼
[Step 10] scRBP trs              ── Trait Relevance Score (RAS × RGS integration)
```

---

## Installation

### Requirements

- Python **3.9, 3.10, or 3.11** (Python 3.12+ not yet supported by `pyscenic`/`arboreto`)
- MAGMA binary (external, required only for Step 9 — `scRBP rgs`)

### Option 1 — Install from PyPI (recommended)

```bash
pip install scRBP
```

This installs scRBP together with all Python dependencies in one step.

### Option 2 — Install from source (development)

```bash
git clone https://github.com/mayunlong89/scRBP.git
cd scRBP/scRBP_package
pip install -e .
```

### Option 3 — Install via conda (recommended for HPC / cluster)

```bash
git clone https://github.com/mayunlong89/scRBP.git
cd scRBP/scRBP_package

conda env create -f environment.yml
conda activate scrbp

pip install -e .
```

### Install MAGMA (for Step 9 only)

MAGMA is a standalone binary not available on PyPI. Download from https://cncr.nl/research/magma and make it executable:

```bash
# Linux example
wget https://cncr.nl/research/magma/software/magma_v1.10_static_linux.zip
unzip magma_v1.10_static_linux.zip -d ~/tools/magma/
chmod +x ~/tools/magma/magma
```

### Verify installation

```bash
scRBP --help
scRBP getGRN --help
```

---

## Quick Start

### Step 1 — Downsample cells with GeoSketch

Large single-cell datasets (>500K cells) should be downsampled before GRN inference. scRBP uses GeoSketch to retain transcriptional diversity while reducing cell count.

```bash
scRBP getSketch \
    --input  PBMC_full.h5ad \
    --output PBMC_sketch_15K.feather \
    --n_cells 15000 \
    --celltype_col celltype \
    --min_cells_per_type 500 \
    --n_pca 100 \
    --seed 42
```

### Step 2 — Infer gene regulatory networks (GRN)

Run GRNBoost2 with multiple random seeds. Each seed produces an independent GRN. Later, these are merged into a consensus network. 

> For this step, user can run 'getGRN' based on 30 random seeds for robustness.

**Gene mode** (RBP → Gene):

```bash
for SEED in $(seq 1 30); do
  scRBP getGRN \
      --matrix    PBMC_sketch_15K.feather \
      --rbp_list  human_RBP_list.txt \
      --output    grn_seed${SEED} \
      --mode      gene \
      --method    grnboost2 \
      --n_workers 20 \
      --correlation True \
      --seed      ${SEED}
done
# Output: grn_seed1_scRBP_gene_GRNs.tsv, grn_seed2_scRBP_gene_GRNs.tsv, ...
```

**Isoform mode** (RBP → Isoform, requires isoform annotation):
```bash
scRBP getGRN \
    --matrix                     PBMC_isoform.feather \
    --rbp_list                   human_RBP_list.txt \
    --output                     iso_grn_seed1 \
    --mode                       isoform \
    --isoform_annotation         gencode_v44_isoform_gene_map.tsv \
    --rbp_agg_method             sum \
    --remove_self_targets        True \
    --min_target_cells_expressed 10 \
    --min_target_mean_expr       0.01 \
    --method                     grnboost2 \
    --n_workers                  20 \
    --seed                       1
# Output: iso_grn_seed1_scRBP_isoform_GRNs.tsv (+ 4 auxiliary files)
```

### Step 3 — Merge GRN seeds into a consensus network

```bash
scRBP getMerge_GRN \
    --pattern "grn_seed*_scRBP_gene_GRNs.tsv" \
    --output  grn_consensus.tsv \
    --n_present 15 \
    --present_rate 0.5
```

Edges appearing in fewer than 50% of seeds are discarded, yielding a stable consensus network.

### Step 4 — Extract regulon candidate modules

```bash
scRBP getModule \
    --input              grn_consensus.tsv \
    --output_merged      modules.tsv \
    --importance_threshold 0.005 \
    --top_n_list         "5,10,50" \
    --target_top_n       "50" \
    --percentile         "0.75,0.9"
```

### Step 5 — Prune with motif-binding evidence

```bash
scRBP getPrune \
    --rbp_targets        modules.tsv \
    --motif_rbp_links    motif2rbp.csv \
    --motif_target_ranks rankings.feather \
    --save_dir           ./pruned/ \
    --rank_threshold     1500
```

### Step 6 — Export regulons to GMT format

```bash
scRBP getRegulon \
    --input       pruned/ctx_scores.csv \
    --out-symbol  regulons_symbol.gmt \
    --out-entrez  regulons_entrez.gmt \
    --map-custom  NCBI38.gene.loc \
    --min_genes   5
```

### Step 7 — Merge region-specific GMT files

```bash
scRBP mergeRegulons \
    --base_dir ./analysis/ \
    --input    regulons_symbol.gmt \
    --output   regulons_combined.gmt \
    --recursive
```

### Step 8 — Compute Regulon Activity Scores (RAS)

Uses the AUCell algorithm to score each cell or cell type for regulon activity. Also computes the Jensen–Shannon divergence-based Regulon Specificity Score (RSS).

```bash
scRBP ras \
    --mode         ct \
    --matrix       PBMC_sketch_15K.feather \
    --regulons     regulons_symbol.gmt \
    --out          ras_output/ \
    --celltypes-csv cell_to_celltype.csv
```

### Step 9 — Regulon genetic association score (RGS)

Links each regulon to GWAS traits using MAGMA gene-set analysis with a 4D null distribution for empirical p-values.

```bash
scRBP rgs \
    --mode      ct \
    --magma     ~/tools/magma/magma \
    --genes-raw gwas.genes.raw \
    --sets      regulons_entrez.gmt \
    --id-type   entrez \
    --out       rgs_output/rgs
```

### Step 10 — Compute Trait relevance Score (TRS)

Integrates RAS and RGS into a unified score:

```
TRS = norm(RAS) + norm(RGS) − λ × |norm(RAS) − norm(RGS)|
```

RBPs with high TRS are both activity-high in the cell type **and** genetically linked to the trait.

```bash
scRBP trs \
    --mode       ct \
    --ras        ras_output/aucell_ct.csv \
    --rgs_csv    rgs_output/rgs_real.csv \
    --out_prefix trs_output/trs
```

---

## Command Reference

| Step | Command | Key Inputs | Key Output |
|------|---------|-----------|------------|
| 1 | `scRBP getSketch` | `.h5ad` / `.feather` | Downsampled cells |
| 2 | `scRBP getGRN` | Expression matrix, RBP list | `*_scRBP_gene_GRNs.tsv` or `*_scRBP_isoform_GRNs.tsv` |
| 3 | `scRBP getMerge_GRN` | Multiple GRN TSV files (glob) | Consensus GRN TSV |
| 4 | `scRBP getModule` | Consensus GRN TSV | Modules TSV |
| 5 | `scRBP getPrune` | Modules TSV, motif files | Pruned scores (CSV) |
| 6 | `scRBP getRegulon` | Pruned scores | Regulons GMT (symbol + Entrez) |
| 7 | `scRBP mergeRegulons` | Multiple GMT files | Merged GMT |
| 8 | `scRBP ras` | Expression matrix, GMT | AUCell scores, RSS matrix |
| 9 | `scRBP rgs` | MAGMA `.genes.raw`, GMT | RGS scores CSV |
| 10 | `scRBP trs` | RAS CSV, RGS CSV | TRS scores CSV |

Use `scRBP <command> --help` to see all parameters for any step.

---

## Dependencies

| Category | Packages |
|----------|---------|
| Core numerics | `numpy`, `pandas`, `scipy`, `scikit-learn` |
| Single-cell I/O | `anndata`, `scanpy`, `loompy` |
| Fast I/O | `polars`, `pyarrow` |
| Cell downsampling | `geosketch` |
| GRN inference | `arboreto` (GRNBoost2 / GENIE3) |
| Motif enrichment | `ctxcore`, `pyscenic` |
| Progress display | `tqdm` |
| GWAS enrichment | **MAGMA** binary (external, user-provided) |

---

## Citation

If you use scRBP in your research, please cite:

> Ma Y. *et al.* *Decoding disease-associated RNA-binding protein-mediated regulatory networks through polygenic enrichment across diverse cellular contexts.* (2026)

---

## License

MIT License. See [LICENSE](LICENSE) for details.

---

## Links

- **GitHub**: https://github.com/mayunlong89/scRBP
- **Issues**: https://github.com/mayunlong89/scRBP/issues
- **Full documentation**: see `scRBP_readme.md` in the repository
