Metadata-Version: 2.4
Name: saltshaker
Version: 1.1.1
Summary: Pattern classification and visualization for MitoSAlt
License: GPL-3.0
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: matplotlib>=3.3.0
Requires-Dist: biopython>=1.78
Provides-Extra: coverage
Requires-Dist: pyBigWig>=0.3.18; extra == "coverage"
Provides-Extra: all
Requires-Dist: pyBigWig>=0.3.18; extra == "all"
Dynamic: license-file

# SAltShaker

A Python package for classifying and visualizing mitochondrial structural variants from MitoSAlt pipeline output.

![SAltShaker Logo](saltshaker/assets/logo.png)

## Overview

SAltShaker is a Python port and extension of the original MitoSAlt ([Basu et al. PLoS Genetics 2020](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009242)) `delplot.R` visualization script. The package provides three modular commands for a flexible analysis workflow:

- **Event calling**: Direct Python port of the original R script's deletion/duplication classification logic
- **Pattern classification**: Rule-based decision tree to distinguish single and multiple type of events from background
- **Visualization**: Circular genome plotting based on the original R script visualization with spatial grouping and annotations

## Installation

### From PyPI (recommended)

```bash
pip install saltshaker
```

For BigWig coverage track support (optional, see [Optional dependencies](#optional-dependencies)):

```bash
pip install saltshaker[coverage]
```

### From GitLab package registry

If you need a pre-release or an internal build not yet on PyPI:

```bash
# pip
pip install saltshaker --index-url https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple

# uv
uv add --index https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple saltshaker

# Poetry
poetry source add --priority=primary dpipe https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple
poetry add saltshaker
```

### From source

```bash
git clone https://gitlab.com/GenomeDX/annotation/saltshaker.git
cd saltshaker
pip install -e .
```

### Docker

```bash
# Build the image
docker build -t saltshaker .

# Run with mounted data directory
docker run -v /path/to/your/data:/data saltshaker [command] [options]
```

### Optional dependencies

SAltShaker's core functionality (call / classify / plot) has no optional dependencies. The following extras unlock additional features:

- **`coverage`** — `pip install saltshaker[coverage]`
  Enables the BigWig coverage track (`saltshaker plot --coverage`).
  Requires [pyBigWig](https://github.com/deeptools/pyBigWig), which has a compiled C extension and is only available on Linux/macOS.

> **Why optional?** `pyBigWig` requires a C build environment and is not available on Windows. Making it optional keeps the base install lightweight and cross-platform.

## Commands

### 1. `saltshaker call` - Event calling (R script port)

**Python port of the original MitoSAlt R script event calling logic.** Calls detected breakpoint clusters as deletions or duplications based on replication origin overlap, following the exact algorithm from `delplot.R`.

**Usage:**

```bash
saltshaker call \
    --prefix sample \                       # Sample identifier
    --output-dir results/ \                 # Output directory
    -c sample.cluster \                     # cluster file from MitoSAlt
    -p sample.breakpoint \                  # breakpoint file from MitoSAlt
    -r reference.fasta \                    # mitochondrial reference genome
    -g 16569 \                              # genome length
    --ori-h-start 16081 --ori-h-end 407 \   # heavy strand origin
    --ori-l-start 5730 --ori-l-end 5763 \   # light strand origin
    -H 0.01 \                               # heteroplasmy threshold
    -f 15 \                                 # flanking sequence size (bp)
    --blacklist                             # Optional: enable with default MT blacklist, OR
    --blacklist custom_bl.bed               # Optional: enable with custom BED file
```

**Outputs:**

- `results/sample.saltshaker_call.tsv` - Human-readable results matching original R script output format
- `results/sample.saltshaker_call_metadata.tsv` - Metadata for downstream processing (classify/plot)

**Algorithm (from original R script):**

The  logic is a direct port of the original MitoSAlt R implementation:

1. **Data loading**: Parses cluster and breakpoint files, merges data to link clusters with D-loop crossing information
2. **Deletion size calculation**: Handles circular genome wraparound for events crossing position 1
3. **Origin-based event calling**: Events overlapping replication origins (OriH or OriL) are called as duplications; non-overlapping events are deletions
4. **Coordinate handling**: Implements the R script's coordinate swapping logic for D-loop crossing events
5. **Flanking sequence analysis**: Uses Biostrings-equivalent pattern matching to find microhomology sequences near breakpoints

This approach identifies the arc complementary to the actual structural change, consistent with the biological interpretation that origin-overlapping events represent duplications of the non-deleted arc.

**Original R script functionality preserved:**

- Exact deletion/duplication calling algorithm
- D-loop crossing detection and coordinate handling
- Flanking sequence extraction and microhomology analysis
- Output TSV format and column names
- Heteroplasmy calculation and filtering

**Python enhancements in call step:**

- Blacklist region detection and flagging if `--blacklist` flag is supplied.

### 2. `saltshaker classify` - Pattern classification

**Extended analysis beyond the original R script.** Performs spatial grouping and classifies the overall pattern as Single, Multiple or background based on heteroplasmy levels and spatial event distribution.

**Usage:**

```bash
saltshaker classify \
    --prefix sample1 \              # Sample identifier (matches call output)
    --input-dir results/ \          # Input directory with .saltshaker_call.tsv
    --output-dir results/ \         # Output directory (default: same as input-dir)
    --blacklist                     # Optional: enable with default MT blacklist, OR
    --blacklist custom_bl.bed       # Optional: enable with custom BED file
    --vcf \                         # Optional: also output VCF format
    --chr-format \                  # Optional: specify one of MT or chrM (default: chrM)
    --high-het 10 \                 # Optional: high heteroplasmy threshold % (default: 20)
    --noise 0.3 \                   # Optional: noise threshold % (default: 1.0)
    --radius 600 \                  # Optional: spatial clustering radius bp (default: 600)
    --multiple-threshold 5 \       # Optional: event count for Multiple pattern (default: 10)
    --dominant-fraction 0.5        # Optional: fraction for dominant group (default: 0.70)
```

**Outputs:**

- `results/sample.saltshaker_classify.txt` - Detailed analysis report with classification reasoning
- `results/sample.saltshaker_classify_metadata.tsv` - Events with spatial group assignments (for plotting)
- `results/sample.saltshaker.vcf` - events in VCF format (if `--vcf` specified)

**Classification criteria:**

**Single pattern**:

- One or few high-heteroplasmy events (≥`high-het`)
- Dominant spatial group (≥`dominant-fraction` of events)
- Few total events (≤`multiple-threshold`)
- Consistent with pathogenic single deletion/duplication

**Multiple pattern**:

- Many events (>`multiple-threshold`)
- Dispersed spatial distribution (no dominant group)
- No high-heteroplasmy events
- Consistent with mtDNA maintenance defects

**Spatial grouping:**
Events within `radius` bp are grouped together.

### 3. `saltshaker plot` - Visualization

Generates circular genome plots based on the original R script visualization with enhanced spatial grouping and other features.

**Usage:**

```bash
saltshaker plot \
    --prefix sample1 \
    --input-dir results/ \
    --output-dir results/plots/ \   # Optional: default is input-dir
    --genes \                       # Optional: enable with default MT genes, OR
    --genes custom_genes.bed \      # Optional: enable with custom BED file
    --blacklist \                   # Optional: enable with default MT blacklist, OR
    --blacklist custom_bl.bed \     # Optional: enable with custom BED file
    --coverage sample1.bw \         # Optional: BigWig file for coverage track (requires saltshaker[coverage])
    --figsize 16 10 \               # Optional: width height (default: 16 10)
    --direction clockwise \         # Optional: clockwise or counterclockwise (default: counterclockwise)
    --del-color red \               # Optional: red or blue (default: blue)
    --dup-color blue \              # Optional: red or blue (default: red)
    --scale fixed                   # Optional: dynamic or fixed (default: dynamic)
```

**Output:**

- `results/plot/sample.saltshaker.png` - Circular genome visualization

**Visualization features:**

- Circular genome with arc-based event display
- Heteroplasmy gradient coloring for all event types (del/dup/blacklist-crossing events - BL)
- Dynamic (min-max %) or fixed (0-100%) heteroplasmy scale
- Spatial grouping for overlapping events
- Optional gene annotations track and labels
- Optional blacklist region marking: light-grey sector shading with dashed boundary lines (BL-crossing events in lime-green gradient)
- Replication origin markers (OriH / OriL): dashed dark-purple radial lines drawn from the circle centre to the genome ring, propagated automatically from `call` through `classify` to `plot`
- Optional BigWig coverage track: step-function ring drawn outside the gene/genome circle, percentile-normalised, with a scale indicator bracket
- Configurable colors and polar direction

#### Color scaling considerations

```bash
Example sample with 5-25% heteroplasmy:

Dynamic scale:
[████████████████████] ← Colors span full gradient from 5% to 25%
5%                 25%

Fixed scale:
[█████░░░░░░░░░░░░░░░] ← Colors span from 0% to 100%, events appear lighter
0%               100%
```

## Input files

### Required (from MitoSAlt pipeline)

**Cluster file** (`.cluster`)

- Tab-separated clustered breakpoint data
- Generated by MitoSAlt's clustering step
- Columns: cluster ID, read counts, positions, heteroplasmy levels

**Breakpoint file** (`.breakpoint`)

- Tab-separated raw breakpoint data
- Generated by MitoSAlt's breakpoint detection step
- Columns: read names, positions, D-loop crossing flags

**Reference genome** (`.fasta`)

- Mitochondrial reference sequence
- Used for flanking sequence extraction and coordinate validation

### Optional

**Blacklist file** (`.bed`)

- BED format regions to flag (e.g., artifacts, repetitive sequences)
- Format: chromosome, start position, end position, name

**Genes file** (`.bed`)

- BED format gene regions to plot around the genomic axis
- Format: chromosome, start position, end position, name, score (0), strand (+), thick start (same as start), think end (same as end), color in rgb code (e.g. `255,255,0`)

## Output formats

### Display TSV (`{prefix}.saltshaker_call.tsv`)

Human-readable format matching original R script output with columns:

- `sample`: Sample identifier
- `cluster_id`: Cluster identifier
- `alt_reads`, `ref_reads`: Read counts
- `heteroplasmy`: Heteroplasmy percentage
- `del_start_range`, `del_end_range`: Coordinate ranges
- `del_size`: Event size in base pairs
- `final_event`: Event type (del/dup)
- `final_start`, `final_end`: Final coordinates
- `blacklist_crossing`: Flag for blacklist overlap
- `seq1`, `seq2`, `seq`: Flanking sequences and microhomology

### Metadata files

**Call metadata** (`{prefix}.saltshaker_call_metadata.tsv`):
Internal format preserving all columns for downstream processing. Header contains `genome_length`, `ori_h_start`, `ori_h_end`, `ori_l_start`, `ori_l_end` so downstream steps do not need to re-specify origin coordinates.

**Classify metadata** (`{prefix}.saltshaker_classify_metadata.tsv`):
Internal format with additional `group` column for spatial group assignments. Origin coordinate headers are forwarded from call metadata. Used by plot command.

### Analysis summary (`{prefix}.saltshaker_classify.txt`)

Human-readable analysis report including:

- Pattern classification (Single/Multiple) with reasoning
- Event statistics and heteroplasmy distribution
- Spatial clustering metrics
- Classification criteria scores

### VCF format (`{prefix}.saltshaker.vcf`)

Standard VCF 4.3 format with structural variant fields:

- `SVTYPE`: DEL or DUP
- `END`: Variant end position
- `SVLEN`: Variant length
- `HF`: Heteroplasmy fraction (0-1)
- `GROUP`: Spatial group identifier
- `CLUSTER`: Original cluster ID
- `DLOOP`: Flag for D-loop crossing
- `BLCROSS`: Flag for blacklist crossing

### Circular plot (`{prefix}.saltshaker.png`)

## Complete workflow example

**Single-sample pipeline:**

```bash
# Step 1: Call events from MitoSAlt output (R script port)
saltshaker call \
    --prefix sample1 \
    --output-dir results/ \
    -c sample1.cluster -p sample1.breakpoint \
    -r reference.fasta \
    -g 16569 --ori-h-start 16081 --ori-h-end 407 \
    --ori-l-start 5730 --ori-l-end 5763 \
    --blacklist

# Step 2: Classify pattern and perform spatial grouping (extended analysis)
saltshaker classify \
    --prefix sample1 \
    --input-dir results/ \
    --blacklist \
    --vcf

# Step 3: Generate visualization (enhanced R script plotting)
saltshaker plot \
    --prefix sample1 \
    --input-dir results/ \
    --output-dir results/plot/ \
    --blacklist \
    --genes
```

**Batch processing multiple samples:**

```bash
for sample in sample1 sample2 sample3; do
    mkdir -p results/${sample}
    # Call events
    saltshaker call --prefix ${sample }--output-dir results/${sample} \
        -c ${sample}.cluster -p ${sample}.breakpoint \
        -r reference.fasta
        -g 16569 --ori-h-start 16081 --ori-h-end 407 \
        --ori-l-start 5730 --ori-l-end 5763 \
        --blacklist
    # Classify events
    saltshaker classify --prefix ${sample} \
        --input-dir results/${sample} \
        --blacklist
    # Plot events
    saltshaker plot --prefix ${sample} \
        --input-dir results/${sample} \
        --blacklist \
        --genes
done
```

## Configuration

Default classification thresholds are defined in `saltshaker/config.py`:

```python
# Heteroplasmy thresholds
HIGH_HET_THRESHOLD = 10.0        # High heteroplasmy threshold (%), --high-het
NOISE_THRESHOLD = 0.3            # Noise threshold (%), --noise

# Spatial clustering
CLUSTER_RADIUS = 600             # Spatial grouping radius (bp), --radius
MIN_CLUSTER_SIZE = 2             # Minimum events per cluster (not configurable via CLI)

# Pattern classification
MULTIPLE_EVENT_THRESHOLD = 5     # Event count for Multiple pattern, --multiple-threshold
DOMINANT_GROUP_FRACTION = 0.5    # Fraction for dominant group (70%), --dominant-fraction
```

These can be customized by modifying the configuration file or via CLI arguments (see `saltshaker classify --help`).

## Command reference

### Global options

All commands support:

- `-h, --help`: Show help message
- `--blacklist [FILE]`: Enable blacklist regions (default: built-in MT blacklist, use as a flag; optional: custom BED file)

### `call` command

**Required:**

- `--prefix STR`: Sample prefix for output files
- `-c, --cluster FILE`: Cluster file from MitoSAlt
- `-p, --breakpoint FILE`: Breakpoint file from MitoSAlt
- `-r, --reference FILE`: Reference genome FASTA
- `-g, --genome-length INT`: Mitochondrial genome length
- `--ori-h-start INT`: Heavy strand origin start
- `--ori-h-end INT`: Heavy strand origin end
- `--ori-l-start INT`: Light strand origin start
- `--ori-l-end INT`: Light strand origin end

**Optional:**

- `--output-dir DIR`: Output directory (default: .)
- `-H, --het-limit FLOAT`: Heteroplasmy threshold (default: 0.01)
- `-f, --flank-size INT`: Flanking sequence size in bp (default: 15)

### `classify` command

**Required:**

- `--prefix STR`: Sample prefix (matches call output)
- `--input-dir DIR`: Input directory containing saltshaker_call_metadata.tsv from call

**Optional:**

- `--output-dir DIR`: Output directory (default: input-dir)
- `--vcf`: Also output VCF format
- `--chr-format`: Choose chromosome format for VCF output: MT or chrM (default: chrM)
- `--high-het FLOAT`: High heteroplasmy threshold % (default: 20)
- `--noise FLOAT`: Noise threshold % (default: 1)
- `--radius INT`: Spatial clustering radius bp (default: 600)
- `--multiple-threshold INT`: Event count for Multiple pattern (default: 10)
- `--dominant-fraction FLOAT`: Fraction for dominant group (default: 0.70)

### `plot` command

**Required:**

- `--prefix STR`: Sample prefix (matches classify output)
- `--input-dir DIR`: Input directory containing saltshaker_classify_metadata.tsv from classify

**Optional:**

- `--output-dir DIR`: Output directory (default: input-dir)
- `--genes [FILE]`: Enable gene annotations (default: built-in MT genes; optional: custom BED file)
- `--figsize WIDTH HEIGHT`: Figure dimensions (default: 16 10)
- `--direction STR`: Plot direction - 'clockwise' or 'counterclockwise' (default: counterclockwise, MitoSAlt original)
- `--del-color STR`: Deletion color - 'red' or 'blue' (default: blue, MitoSAlt original)
- `--dup-color STR`: Duplication color - 'red' or 'blue' (default: red, MitoSAlt original)
- `--scale STR`: Heteroplasmy scale - 'dynamic' (min-max per category) or 'fixed' (0-100%) (default: dynamic)
- `--show-origins STR`: Draw replication origin markers - 'on' or 'off' (default: on); origin coordinates are propagated automatically from `call` via the metadata file
- `--coverage BW_FILE`: Path to a BigWig (`.bw`) file; draws a coverage ring outside the genome circle, normalised to percentile scale. Requires `pip install saltshaker[coverage]`.

## Dependencies

### Core (always installed)

```text
pandas>=1.3.0
numpy>=1.20.0
matplotlib>=3.3.0
biopython>=1.78
```

### Optional - `coverage` extra

```text
pyBigWig        # BigWig coverage track; Linux/macOS only (C extension)
```

Install with: `pip install saltshaker[coverage]`

## Package structure

```text
saltshaker/
├── __init__.py
├── __main__.py                  # CLI entry point with subcommands
├── config.py                    # Configuration and thresholds
├── types.py                     # Type definitions and data structures
├── event_caller.py              # Event calling (R script port)
├── classifier.py                # Pattern classification (single vs multiple vs background)
├── spatial.py                   # Spatial grouping
├── visualizer.py                # Circular plotting
├── utils.py                     # Utility functions
│
├── assets/
│   └── logo.png
│
├── layout/                      # Layout engine
│   ├── __init__.py
│   ├── engine.py                # LayoutEngine class
│   └── types.py                 # Layout-specific types
│
├── cli/
│   ├── __init__.py
│   ├── call.py                  # Call subcommand
│   ├── classify.py              # Classify subcommand
│   └── plot.py                  # Plot subcommand
│
├── io/
│   ├── __init__.py
│   ├── readers.py               # File readers (blacklist, intermediate, gene annotations, BigWig)
│   ├── writers.py               # TSV and summary output
│   └── vcf_writer.py            # VCF format output
│
├── data/
│   ├── __init__.py                              # Default file paths
│   ├── gencode.v49.annotation.MT_genes.bed      # Default hg38 MT gene annotations
│   └── mt_blacklist_regions.bed                 # Default MT blacklist regions
│
├── docs/
│   ├── classification_algorithm.md              # Classification logic documentation
│   └── visualizer_description.md               # Visualizer architecture documentation
│
└── tests/
    ├── conftest.py                              # Shared fixtures
    ├── pytest.ini
    ├── fixtures/
    │   ├── inputs/
    │   │   ├── human_mt_rCRS.fasta              # Reference FASTA
    │   │   ├── test_breakpoints.breakpoint      # Raw breakpoints for integration tests
    │   │   ├── test_clusters.cluster            # Raw clusters for integration tests
    │   │   ├── viz_sample_small.tsv             # Small test dataset (15 events)
    │   │   └── viz_sample_large.tsv             # Large test dataset (80 events)
    │   └── expected/
    │       ├── ground_truth.json                # Expected call/classify outputs
    │       └── visualizer_layouts.json          # Expected visualization layout characteristics
    ├── unit/
    │   ├── test_bigwig_reader.py                # BigWig reader unit tests
    │   ├── test_helpers.py                      # Utility function tests
    │   ├── test_intermediate_io.py              # Intermediate file read/write roundtrip tests
    │   ├── test_layout_engine.py                # Layout engine unit tests
    │   └── test_label_positioning.py            # Label collision detection tests
    └── integration/
        ├── test_saltshaker_output.py            # End-to-end event calling tests
        └── test_visualizer.py                   # End-to-end visualization tests
```

## Documentation

- [Classification algorithm](saltshaker/docs/classification_algorithm.md) — Detailed explanation of the Single vs Multiple pattern classification logic
- [Visualizer architecture](saltshaker/docs/visualizer_description.md) — Layout engine, visual elements, configuration reference
