Metadata-Version: 2.4
Name: nanocompass
Version: 0.1.0
Summary: Multi-aligner consensus framework for nanopore RNA-seq splice junction detection
Author-email: "Kevin R. Roy" <kevinrjroy@gmail.com>
Maintainer-email: "Kevin R. Roy" <kevinrjroy@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/kevinroykt/nanocompass
Project-URL: Documentation, https://github.com/kevinroykt/nanocompass#readme
Project-URL: Repository, https://github.com/kevinroykt/nanocompass
Project-URL: Issues, https://github.com/kevinroykt/nanocompass/issues
Keywords: nanopore,RNA-seq,splice junctions,alternative splicing,bioinformatics,long-read sequencing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pysam>=0.20.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"

# Nanopore COMPASS

**COMPASS** (Comparison Of Multiple alignment Programs for Alternative Splice site discovery) adapted for Oxford Nanopore long-read RNA sequencing data.

## Overview

Nanopore COMPASS is a multi-aligner consensus framework for detecting splice junctions and alternative splice sites (ASS) in nanopore direct RNA sequencing data. It addresses nanopore-specific challenges including high error rates (~5-15%), homopolymer errors, and alignment artifacts.

### Key Features

- **4-aligner consensus strategy** for robust junction detection
- **Nanopore error-tolerant scoring** (reduced penalties for indels, homopolymer errors)
- **Quality-based filtering** using perfect flanking sequence matches
- **Ambiguous junction resolution** for homopolymer regions
- **RPG (ribosomal protein gene) analysis** comparing highly-expressed genes vs genome-wide
- **Organism-specific configuration** (yeast, human, extensible)

### Original COMPASS

This implementation builds on the original COMPASS framework for short-read data:
- **Paper**: Roy et al., RNA (2023), PMID: 37956322
- **GitHub**: https://github.com/k-roy/COMPASS

## Multi-Aligner Strategy

Nanopore COMPASS uses 4 complementary aligners with orthogonal algorithms:

| Aligner | Algorithm Type | Primary Strength | Speed |
|---------|---------------|------------------|-------|
| **Minimap2** | Seed-and-chain | High sensitivity, general-purpose | Fast (9-40x vs GraphMap2) |
| **uLTRA** | Graph + collinear chaining | Small exons (11-20nt), annotation-guided | Medium |
| **deSALT** | De Bruijn graph | Complex isoforms, alternative splicing | Medium-slow |
| **GraphMap2** | 3-phase mapping | High accuracy validation | Slow |

**Consensus Selection Hierarchy**:
1. Lowest error-weighted alignment score
2. Annotation match (if tie)
3. Most junctions detected per read
4. Aligner precedence: GraphMap2 > uLTRA > deSALT > Minimap2
5. Random (if still tied)

## Installation

### Dependencies

**Python** (≥3.8):
- pysam
- pandas
- numpy
- pyyaml
- matplotlib
- seaborn

**Aligners**:
- [minimap2](https://github.com/lh3/minimap2) - Fast splice-aware aligner
- [uLTRA](https://github.com/ksahlin/ultra) - Annotation-guided alignment
- [deSALT](https://github.com/ydLiu-HIT/deSALT) - De Bruijn graph aligner
- [GraphMap2](https://github.com/lbcb-sci/graphmap2) - High-accuracy 3-phase mapping

**R packages** (optional, for plots):
- ggplot2
- UpSetR
- ComplexHeatmap

### Setup

```bash
# Clone or navigate to repository
cd /oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass

# Add to Python path
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass/src

# Install Python dependencies (conda recommended)
conda create -n nanopore-compass python=3.10 pysam pandas pyyaml matplotlib seaborn
conda activate nanopore-compass

# Install aligners (via conda or from source)
conda install -c bioconda minimap2 graphmap2
# uLTRA and deSALT: see respective GitHub repositories for installation
```

## Quick Start

### 1. Prepare Configuration

Edit `config/yeast.yaml` to specify:
- Reference genome FASTA path
- Annotation GFF path
- Introns file path (COMPASS format)
- Aligner paths and parameters
- Filtering thresholds

### 2. Run Alignments

```bash
# Option A: Run all aligners sequentially
./bin/align_and_process.sh \
  --config config/yeast.yaml \
  --reads sample.fastq \
  --output output_dir \
  --sample sample_name

# Option B: Run individual aligners
minimap2 -ax splice -uf -k14 -G 5000 ref.fa reads.fq > minimap2.sam
ultra -i annotations.gtf --ont ref.fa reads.fq -o ultra.sam
deSALT aln -r ref.fa -a annotations.gtf -k 15 reads.fq > desalt.sam
graphmap2 align -r ref.fa -d reads.fq -o graphmap2.sam --extcigar
```

### 3. Extract and Merge Junctions

```python
from pathlib import Path
from compass_functions import JunctionExtractor
from junction_comparison import aggregate_multi_aligner_junctions
from utils import load_intron_annotations

# Extract junctions from each aligner
aligners = ['minimap2', 'ultra', 'desalt', 'graphmap2']
junction_tables = {}

for aligner in aligners:
    bam_path = Path(f"output_dir/{aligner}.sorted.bam")
    fasta_path = Path("ref.fa")

    extractor = JunctionExtractor(bam_path, fasta_path)
    junctions = extractor.extract_all_junctions()
    junction_tables[aligner] = junctions
    extractor.close()

# Aggregate across aligners
aggregated = aggregate_multi_aligner_junctions(junction_tables)

# Load annotations
annotated_introns = load_intron_annotations(Path("introns.tsv"))

# Filter for high confidence
from junction_comparison import filter_high_confidence_junctions

high_confidence = filter_high_confidence_junctions(
    aggregated,
    min_reads=5,
    min_aligners=2,
    min_perfect_matches=5
)

print(f"Total junctions: {len(aggregated)}")
print(f"High-confidence: {len(high_confidence)}")
```

### 4. Analyze Results

```python
from junction_comparison import compare_aligner_performance

# Compare aligner performance
stats = compare_aligner_performance(junction_tables, annotated_introns)

for aligner, metrics in stats.items():
    print(f"\n{aligner}:")
    print(f"  Total junctions: {metrics['total_junctions']}")
    print(f"  Exact annotated: {metrics['exact_annotated_pct']:.1f}%")
    print(f"  Novel: {metrics['novel_pct']:.1f}%")
```

## Filtering Strategy

### Multi-Tier Filtering for Alternative Splice Sites

**Tier 1: Minimum Read Support**
- Threshold: ≥5 reads per junction
- Eliminates random alignment errors

**Tier 2: Multi-Sample Consistency**
- Threshold: Junction in ≥2 biological replicates
- Eliminates sample-specific artifacts

**Tier 3: Flanking Sequence Quality** (PRIMARY FILTER)
- Threshold: ≥5-6bp perfect matches on BOTH sides of junction
- Distinguishes true junctions from alignment artifacts
- Key insight: True ASS even 2bp from annotation should have clean flanks

**Tier 4: Splice Site Motif Scoring** (for classification, not exclusion)
- Score motifs but don't exclude based on non-canonical
- Non-canonical 3'SS (non-YAG) can be genuine (Meyer et al. 2023)

**Tier 5: Multi-Aligner Consensus**
- High confidence: ≥3 aligners OR ≥2 aligners including GraphMap2
- Medium confidence: ≥2 aligners with orthogonal algorithms
- Low confidence: Single aligner only

### Recommended Thresholds

**Conservative (default)**:
- Min reads: 5
- Min samples: 2
- Perfect matches: 5bp both sides
- Min aligners: 2

**Publication-quality (stringent)**:
- Min reads: 10
- Min samples: 3
- Perfect matches: 6bp both sides
- Min aligners: 2 (with GraphMap2 validation)

## Project Structure

```
nanopore-compass/
├── bin/                          # Executable scripts
│   ├── nanopore-compass          # Main pipeline
│   ├── align_and_process.sh      # Multi-aligner wrapper
│   └── combine_junctions.py      # Merge sample junctions
│
├── src/                          # Core modules
│   ├── utils.py                  # Shared utilities
│   ├── sequence_analysis.py      # Splice site scoring, motifs
│   ├── compass_functions.py      # Junction extraction, ambiguity resolution
│   └── junction_comparison.py    # Multi-aligner consensus
│
├── scripts/                      # Analysis scripts
│   ├── rpg_analysis.py           # RPG vs non-RPG comparison
│   ├── ass_detection.py          # Alternative splice site detection
│   ├── quality_control.R         # QC plots
│   └── aligner_comparison.R      # Multi-aligner visualization
│
├── config/                       # Configuration files
│   ├── yeast.yaml                # S. cerevisiae parameters
│   └── human.yaml                # H. sapiens parameters
│
├── reference/                    # Reference data
│   ├── S_cerevisiae_all_introns.tsv
│   └── S_cerevisiae_rpg_genes.txt
│
└── test/                         # Test data and scripts
    └── test_data/
```

## Configuration

Configuration files use YAML format with organism-specific parameters:

**Key Sections**:
- `reference files`: Genome FASTA, annotations, intron catalog
- `junction_parameters`: Intron size constraints, quality thresholds
- `splice_sites`: Organism-specific consensus sequences
- `aligners`: Paths, parameters, and settings for each aligner
- `filtering`: Multi-tier thresholds for junction validation
- `error_weights`: Nanopore-specific error tolerance

See `config/yeast.yaml` for complete example.

## Advanced Usage

### RPG (Ribosomal Protein Gene) Analysis

Compare splicing characteristics between highly-expressed RPG genes and genome-wide:

```python
from scripts.rpg_analysis import compare_rpg_vs_nonrpg

results = compare_rpg_vs_nonrpg(
    junctions_file="high_confidence_junctions.tsv",
    rpg_genes_file="reference/S_cerevisiae_rpg_genes.txt"
)

# Results include:
# - Fidelity comparison (exact annotated %)
# - Alternative splicing frequency
# - Splice site motif strength
# - Condition-specific effects (WT vs mutant)
```

### Alternative Splice Site Detection

```python
from scripts.ass_detection import detect_alternative_splice_sites

ass_events = detect_alternative_splice_sites(
    junctions_file="high_confidence_junctions.tsv",
    annotated_introns="reference/S_cerevisiae_all_introns.tsv",
    min_distance=10  # bp from annotated sites
)

# ASS types:
# - alt_5ss: Alternative 5' splice site
# - alt_3ss: Alternative 3' splice site
# - cryptic: Cryptic introns (within exons)
# - exon_skip: Exon skipping events
```

### Ambiguous Junction Resolution

```python
from compass_functions import resolve_ambiguous_junctions
from sequence_analysis import SpliceMotifScorer

# Initialize motif scorer
scorer = SpliceMotifScorer(
    five_ss_consensus="GTATGT",
    five_ss_penalties=[4, 3, 1, 1, 2, 1],
    three_ss_consensus="YYYAG",
    three_ss_penalties=[1, 1, 1, 3, 4]
)

# Resolve ambiguous junctions
resolved = resolve_ambiguous_junctions(
    junction_dict=raw_junctions,
    fasta_path=Path("ref.fa"),
    motif_scorer=scorer,
    max_shift=10
)
```

## Nanopore-Specific Adaptations

### Error-Weighted Scoring

```python
# Nanopore error model
score = (
    mismatches * 1.0 +                # Full weight
    indels * 0.5 +                    # Reduced (common in nanopore)
    indels_in_homopolymer * 0.2 +     # Minimal (expected errors)
    soft_clips * 0.3                  # Moderate penalty
)
```

### Quality-Based Filtering

Instead of distance-based filtering, use **perfect flanking sequence matches**:

```python
# Extract from BAM alignment
upstream_matches, downstream_matches = count_perfect_matches_at_junction(read, junction_pos)

# Filter criteria
if upstream_matches >= 5 and downstream_matches >= 5:
    # High-quality junction (true biological event)
    pass
else:
    # Likely alignment artifact
    reject()
```

This approach correctly identifies true ASS even when only 2-3bp from annotated sites.

---

## Parallel Processing for Large Datasets

For datasets with 100M+ reads, use the parallelization pipeline for ~25x speedup.

### Design Principle

**Pre-alignment FASTQ chunking** is required for COMPASS 4-way consensus:
- Same read can map to different genomic locations in different aligners
- COMPASS compares the SAME read across all 4 aligners
- Pre-alignment chunking guarantees same read order in all 4 output BAMs

### Workflow

```
BAM → FASTQ → chunk (25x) → align with 4 aligners → 4-way consensus → merge
```

### Usage

```bash
# Run complete parallel pipeline
bash scripts/run_parallel_pipeline.sh \
    --bam input.bam \
    --sample sample_name \
    --chunks 25 \
    --output parallel_output/

# This will:
# 1. Convert BAM to FASTQ (bedtools bamtofastq)
# 2. Split FASTQ into 25 chunks
# 3. Submit SLURM array job to align each chunk with all 4 aligners
# 4. Submit SLURM array job to process each chunk with 4-way consensus
# 5. Merge all chunks into final junction table
```

### Performance

**Single aligner (100M reads)**:
- Serial: 250 min (4.2 hours)
- Parallel (25 cores): 10 minutes
- **Speedup: ~25x**

**4 aligners (100M reads)**:
- Serial: 1000 min (16.7 hours)
- Parallel (25 cores): 40-50 minutes
- **Speedup: ~20-25x**

### Manual Workflow (Step-by-Step)

For more control, run steps individually:

```bash
# 1. Convert BAM to FASTQ
bedtools bamtofastq -i input.bam -fq input.fastq

# 2. Split FASTQ
python scripts/split_fastq.py \
    --fastq input.fastq \
    --chunks 25 \
    --output fastq_chunks/

# 3. Align each chunk (SLURM array job)
sbatch --array=1-25 slurm/align_chunk_4way.sh

# 4. Process with consensus (SLURM array job)
sbatch --array=1-25 slurm/process_chunk_consensus.sh

# 5. Merge results
python scripts/merge_junction_chunks.py \
    --input junction_chunks/ \
    --output final_junctions.tsv
```

**See [docs/PARALLELIZATION.md](docs/PARALLELIZATION.md) for complete technical details.**

---

## Output Files

### Junction Tables

**Per-aligner junctions**: `{aligner}_junctions.tsv`
- Columns: chrom, five_ss, three_ss, count, five_dinuc, three_dinuc, is_canonical, is_exact_annotated, etc.

**Aggregated junctions**: `consensus_junctions.tsv`
- Additional columns: num_aligners, aligner_support, confidence, validators

**Filtered junctions**: `high_confidence_junctions.tsv`
- Subset passing all filtering criteria

### Statistics

**Aligner comparison**: `aligner_performance.tsv`
- Metrics per aligner: total junctions, exact annotated %, novel %

**Filtering summary**: `filtering_stats.tsv`
- Junction counts at each filtering tier

**RPG analysis** (if enabled): `rpg_vs_nonrpg_summary.tsv`
- Comparative metrics between RPG and non-RPG introns

## Performance

**Computational Requirements** (yeast genome, 1M reads):
- Memory: 16-32 GB RAM
- Storage: ~50 GB for intermediate files
- Time: 4-8 hours (4 aligners, 8 threads each)

**Speed by Aligner**:
- Minimap2: ~5-10 minutes
- uLTRA: ~30-60 minutes
- deSALT: ~1-2 hours
- GraphMap2: ~4-6 hours (slowest but most accurate)

**Optimization**:
- Run aligners in parallel (separate SLURM jobs)
- Use Minimap2 + uLTRA for fast initial analysis
- Add GraphMap2 for high-confidence validation

## Citation

If you use Nanopore COMPASS, please cite:

**Original COMPASS**:
> Roy KR, et al. (2023) COMPASS: Comparison Of Multiple Alignment Programs for
> Alternative Splice site discovery. *RNA*, 29(11):1833-1847. PMID: 37956322

**Nanopore Adaptations**:
> [Your publication using this framework]

**Aligners**:
- Minimap2: Li H. (2018) Bioinformatics 34(18):3094-3100
- uLTRA: Sahlin K & Mäkinen V. (2021) Genome Biol 22:142
- deSALT: Liu B, et al. (2019) Genome Biol 20:274
- GraphMap2: Sović I, et al. (2016) Nat Commun 7:11307

## License

[Specify license - consider MIT or GPL to match original COMPASS]

## Contact

Kevin R. Roy
Larsmon Lab, Stanford University
[Contact information]

## References

- Original COMPASS: https://github.com/k-roy/COMPASS
- Non-canonical splice sites: Meyer et al. (2023), PMC10711555
- Yeast genome: Saccharomyces Genome Database (SGD)
- Nanopore basecalling: Oxford Nanopore Technologies

## Version History

**v1.0.0** (2026-03-09)
- Initial implementation
- 4-aligner consensus strategy
- Nanopore error model
- Yeast configuration
- RPG analysis module
- Alternative splice site detection

---

Last updated: 2026-03-09
