Metadata-Version: 2.4
Name: svphaser
Version: 2.2.2
Summary: Structural-variant phasing from HP-tagged long-read BAMs
Project-URL: Homepage, https://github.com/SFGLab/SvPhaser
Project-URL: Issues, https://github.com/SFGLab/SvPhaser/issues
Project-URL: Source, https://github.com/SFGLab/SvPhaser
Author-email: SvPhaser Team <you@lab.org>
License: MIT
License-File: LICENSE
Keywords: BAM,ONT,VCF,genomics,long-reads,phasing,structural-variants
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Requires-Dist: cyvcf2>=0.30
Requires-Dist: pandas>=2.1
Requires-Dist: pysam>=0.23
Requires-Dist: typer>=0.14
Provides-Extra: bench
Requires-Dist: py-spy>=0.3; extra == 'bench'
Requires-Dist: pytest-benchmark>=4.0; extra == 'bench'
Provides-Extra: dev
Requires-Dist: black>=24.3; extra == 'dev'
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: hypothesis>=6.90; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pandas-stubs>=2.0; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest-xdist>=3.5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: tox>=4.10; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: plots
Requires-Dist: matplotlib>=3.7; extra == 'plots'
Description-Content-Type: text/markdown

# SvPhaser

> **Haplotype-aware structural-variant (SV) phasing and genotyping from long-read data**

[![PyPI version](https://img.shields.io/pypi/v/svphaser.svg?logo=pypi)](https://pypi.org/project/svphaser/)
[![Python](https://img.shields.io/pypi/pyversions/svphaser.svg)](https://pypi.org/project/svphaser/)
[![License](https://img.shields.io/github/license/SFGLab/SvPhaser.svg)](LICENSE)

---

**SvPhaser** assigns **haplotype-aware genotypes** to **pre-called structural variants (SVs)** using **HP-tagged long-read alignments** (PacBio HiFi, ONT Q20+, etc.).

It fills a critical gap in long-read SV analysis:

* SV callers (e.g. Sniffles2) **discover variants**
* SvPhaser **phases and genotypes them** (`1|0`, `0|1`, `1|1`, or `./.`)
* with explicit **read-level evidence** and a quantitative **genotype quality (GQ)** score

SvPhaser is:
* **Caller-agnostic** — works with any SV VCF format
* **Deterministic** — no random sampling or HMMs; reproducible results
* **Designed for large-scale benchmarking and biological interpretation** — CSV-first output for transparent analysis

---

## Key features

* **Post-hoc SV phasing** from HP-tagged BAM/CRAM — no re-calling needed
* **Per-chromosome parallelization** — efficiently scales on HPC and multi-core systems
* **SV-type-aware evidence detection** — specialized logic for DEL / INS / INV / BND / DUP
* **Deterministic Δ-based decision logic** — haplotype imbalance thresholds, no sampling
* **Strict size consistency controls** — optional size-matching for DEL/INS variants
* **Explicit confidence scoring** — Phred-scaled GQ capped at 99, with derivable binning
* **CSV-first design** — transparent per-SV metrics for benchmarking and debugging
* **VCF-compliant output** — rich `SVP_*` INFO annotations for downstream analysis
* **Read-level evidence tracking** — counts by haplotype (HP1, HP2, untagged) with reason codes
* **Hybrid support counting** — combines HP-tagged + untagged reads with configurable thresholds

---

## Installation

### From PyPI (recommended)

```bash
# Requires Python >= 3.9
pip install svphaser
```

Optional extras:

```bash
pip install "svphaser[plots]"   # plotting utilities
pip install "svphaser[bench]"   # benchmarking helpers
pip install "svphaser[dev]"     # development + linting
```

### From source

```bash
git clone https://github.com/SFGLab/SvPhaser.git
cd SvPhaser
pip install -e .
```

---

## Inputs & requirements

SvPhaser requires **two inputs only**:

1. **Unphased SV VCF** (`.vcf` / `.vcf.gz`)

   * Produced by an SV caller (e.g. Sniffles2)
   * May optionally contain `RNAMES` INFO for precise read support

2. **HP-tagged BAM/CRAM**

   * Long-read alignments with haplotype tags (`HP=1/2`)
   * Generated by an upstream phasing pipeline (e.g. WhatsHap)

> ⚠️ If the BAM does not contain HP tags, SvPhaser cannot assign haplotypes.

---

## Quick start (CLI)

```bash
svphaser phase \
  sample_unphased.vcf.gz \
  sample.sorted_phased.bam \
  --out-dir results/ \
  --min-support 10 \
  --min-tagged-support 3 \
  --major-delta 0.60 \
  --equal-delta 0.10 \
  --support-mode hybrid \
  --dynamic-window \
  --tie-to-hom-alt \
  --gq-bins "30:High,10:Moderate" \
  --threads 32
```

### Key parameters

| Parameter | Default | Meaning |
|-----------|---------|---------|
| `--min-support` | 10 | Minimum total supporting reads (HP1+HP2+NOHP) to keep an SV; others are dropped to `./.` |
| `--min-tagged-support` | 3 | Minimum HP-tagged reads (HP1+HP2) needed for directional phasing (`1\|0` or `0\|1`) |
| `--major-delta` | 0.60 | Haplotype imbalance threshold (max HP count / tagged total) for strong consensus |
| `--equal-delta` | 0.10 | Tie threshold (\|HP1-HP2\| / tagged total); below this, treated as both haplotypes support (→ `1\|1`) |
| `--tie-to-hom-alt` | True | When tie detected and both haplotypes carry reads, emit `1\|1` (else `./.`) |
| `--support-mode` | hybrid | Count method: `hybrid` (HP tagged preferred), `tagged-only`, or `all` |
| `--gq-bins` | "30:High,10:Moderate" | Confidence cutoffs for soft binning into labels (e.g., High≥30, Moderate≥10) |
| `--threads` | 1 | Number of parallel workers (one per chromosome) |
| `--no-svp-info` | — | Disable writing `SVP_*` INFO annotations to output VCF |
| `--size-match-required` | True | For DEL/INS: enforce size consistency between VCF record and read evidence |
| `--size-tol-abs` | 10 | Absolute size tolerance (bp) for DEL/INS matching |
| `--size-tol-frac` | 0.0 | Fractional size tolerance for DEL/INS matching |

---

## Outputs

For an input `sample.vcf.gz`, SvPhaser produces:

### Primary: `sample_phased.csv`

A tabular summary with per-SV analysis, including:

* **Metadata**: `chrom`, `pos`, `id`, `end`, `svtype` (DEL/INS/INV/BND/DUP)
* **Evidence counts**: `hp1`, `hp2`, `nohp` (haplotype-tagged and untagged supporting reads)
* **Totals**: `tagged_total` (HP1+HP2), `support_total` (HP1+HP2+NOHP)
* **Decision metrics**:
  - `delta` — haplotype imbalance (max/tagged_total)
  - `equal_delta` — absolute difference (|HP1-HP2|/tagged_total)
  - `tag_frac` — fraction of support that is HP-tagged
* **Final calls**:
  - `gt` — phased genotype (`1|0`, `0|1`, `1|1`, or `./.`)
  - `gq` — Phred-scaled genotype quality (0–99)
  - `gq_label` — optional binned confidence level (e.g., "High", "Moderate")
  - `reason` — explanation code (e.g., "MinSupport", "Tie", "LowTagged")

### Secondary: `sample_phased.vcf`

Interoperability output with:

* **FORMAT fields**: `GT` (phased), `GQ` (quality)
* **INFO annotations** (when `--svp-info` enabled):
  - `SVP_HP1`, `SVP_HP2`, `SVP_NOHP` — read counts
  - `SVP_TAGFRAC` — fraction tagged
  - `SVP_DELTA` — haplotype imbalance
  - `SVP_GQBIN` — confidence level label

The CSV is the **primary artifact for analysis**; the VCF is for compatibility and downstream tools.

---

## Phasing decision logic (quick reference)

For each SV, SvPhaser counts reads by haplotype tag (HP=1, HP=2, or missing) and applies a **deterministic decision tree**:

1. **Minimum support gate**: If `support_total (HP1+HP2+NOHP) < min_support` → emit `./.` and drop SV
2. **Tagged support gate**: If `tagged_total (HP1+HP2) < min_tagged_support` → emit `./.`
3. **Tie detection**: If `|HP1 - HP2| / tagged_total ≤ equal_delta`
   - If `tie_to_hom_alt=True` and both HP1 > 0 and HP2 > 0 → emit `1|1` (both haplotypes carry)
   - Else → emit `./.` (ambiguous)
4. **Strong majority**: If `max(HP1, HP2) / tagged_total ≥ major_delta`
   - If HP1 > HP2 → emit `1|0` (ALT on haplotype 1)
   - If HP2 > HP1 → emit `0|1` (ALT on haplotype 2)
5. **Else**: → emit `./.` (weak or no signal)

**Genotype Quality (GQ)** is calculated from a **Phred-scaled binomial tail probability**:
* For shallow coverage (N ≤ 200): exact binomial test
* For deep coverage (N > 200): continuity-corrected normal approximation (avoids overflow)
* Capped at 99 (Phred scale)

A full, implementation-faithful description of the algorithm—including:

* evidence collection
* haplotype decision logic
* pseudoalgorithm
* workflow diagram

is provided in:

➡️ **`docs/Methodology.md`**

This document is the authoritative reference for reviewers and users seeking algorithmic clarity.

---

## Python API

```python
from pathlib import Path
from svphaser import phase

# Simple usage
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
)

# Full control
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    tie_to_hom_alt=True,
    gq_bins="30:High,10:Moderate",
    threads=8,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
)

print(f"Phased VCF: {out_vcf}")
print(f"Summary CSV: {out_csv}")
```

Returns a tuple: `(phased_vcf_path, summary_csv_path)`

Alternatively, use the lower-level API directly:

```python
from svphaser.phasing.io import phase_vcf
from svphaser.phasing.types import WorkerOpts

opts = WorkerOpts(
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    tie_to_hom_alt=True,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
    gq_bins=[(30, "High"), (10, "Moderate")],
)

phase_vcf(
    Path("sample.vcf.gz"),
    Path("sample.bam"),
    out_dir=Path("results"),
    worker_opts=opts,
    threads=8,
)
```

---

## Repository structure

```
SvPhaser/
├─ src/svphaser/            # main package
│  ├─ cli.py               # CLI interface (Typer app)
│  ├─ __init__.py          # public API (phase() function)
│  ├─ logging.py           # logging configuration
│  ├─ phasing/             # core algorithms & I/O
│  │  ├─ algorithms.py     # haplotype classification, GQ calculation (pure math)
│  │  ├─ io.py            # orchestration, CSV/VCF writing (per-chromosome workers)
│  │  ├─ _workers.py      # internal: per-chromosome worker, read evidence counting
│  │  ├─ types.py         # WorkerOpts, CallTuple, type aliases
│  │  └─ __init__.py      # public API exports
│  └─ py.typed            # PEP 561 marker for type information
│
├─ tests/                   # unit & regression tests
│  ├─ test_algorithms.py   # GQ, classification logic
│  ├─ test_cli_smoke.py    # CLI smoke tests
│  ├─ test_io.py          # CSV/VCF output validation
│  ├─ test_workers.py     # BAM parsing, read counting
│  └─ data/               # minimal test fixtures
│
├─ docs/                    # documentation
│  ├─ Methodology.md       # algorithmic deep-dive (implementation-faithful)
│  └─ Presentation/        # slide decks & figures
│
├─ Benchmarking_Analysis/   # perf analysis & results
├─ pyproject.toml          # PEP 621 metadata, build config
├─ requirements.txt        # runtime dependencies (mirror of pyproject)
├─ requirements-dev.txt    # dev/test dependencies
├─ README.md              # this file
├─ CONTRIBUTING.md        # contributor guidelines
├─ CODE_OF_CONDUCT.md     # community standards
├─ LICENSE                # MIT
└─ CHANGELOG.md           # version history
```

### Core modules

**`algorithms.py`** — Pure mathematics (no I/O)
* `phasing_gq(n1, n2)` — Phred-scaled genotype quality (binomial tail + normal approx)
* `classify_haplotype(n1, n2, ...)` — GT decision tree (returns `("1|0"|"0|1"|"1|1"|"./.", gq)`)
* Threshold logic: `major_delta`, `equal_delta`, `min_support`, `tie_to_hom_alt`

**`_workers.py`** — Per-chromosome logic
* Read BAM for each chromosome, count HP tags
* Apply size-consistency filters (DEL/INS)
* Call `classify_haplotype()` for each SV
* Return formatted results (gt, gq, reason)

**`io.py`** — Orchestration & I/O
* Parse VCF header, spawn workers (one per chromosome)
* Merge per-chromosome results, apply global filters
* Write phased VCF + CSV summary
* Backfill optional columns (gq_label, tag_frac, etc.)

---

## Citing SvPhaser

If SvPhaser contributes to your research, please cite:

```bibtex
@software{svphaser2026,
  author  = {Pranjul Mishra and Sachin Gadakh},
  title   = {SvPhaser: Haplotype-aware phasing of structural variants from long-read data},
  version = {2.1.x},
  year    = {2026},
  url     = {https://github.com/SFGLab/SvPhaser},
  note    = {PyPI: https://pypi.org/project/svphaser/}
}
```

For maximum reproducibility, include the exact git commit hash used.

---

## License

SvPhaser is released under the **MIT License** — see [LICENSE](LICENSE).

---

## Contact

Developed at **SFG Lab (BioAI)**.

* **Pranjul Mishra** — [pranjul.mishra@proton.me](mailto:pranjul.mishra@proton.me)
* **Sachin Gadakh** — [s.gadakh@cent.uw.edu.pl](mailto:s.gadakh@cent.uw.edu.pl)

Bug reports and feature requests: please open a GitHub issue.
