Metadata-Version: 2.4
Name: cnvis
Version: 0.1.0
Summary: A Python toolkit for copy number visualization and multi-sample comparison
Home-page: https://github.com/yelingqun/cnvis
Author: Lingqun Ye
Author-email: yelingqun@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: pathlib
Requires-Dist: pyBigWig
Requires-Dist: bioframe
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CNVis

A lightweight Python toolkit for **C**opy **N**umber **Vis**ualization and multi-sample comparison. Designed for publication-quality genome-wide plots with minimal dependencies.

## Features

- **Binned coverage analysis** from BedGraph, CSV, or BigWig files
- **Multi-sample coverage matrices** at gene, chromosome arm, or fixed-bin resolution
- **Publication-quality genome-wide plots** with chromosome-proportional layouts
- **Segment-based smoothing** using ASCAT or other segmentation results
- **Built-in segmentation** using PELT or CBS algorithms for quick exploration
- **Gap filtering** with multiple methods (constant fill, neighbor interpolation, removal)
- **Bundled reference data** including hg38 gap regions for easy filtering

## Requirements

- Python 3.7 or later
- pandas, numpy, matplotlib, seaborn
- bioframe, pyBigWig
- ruptures (optional, for PELT segmentation)

## Installation

```bash
pip install git+https://github.com/yelingqun/cnvis.git
```

Or download the zip from GitHub:
```bash
pip install cnvis-main.zip
```

---

## Quick Start

```python
import cnvis as cv
import pandas as pd

# Build a coverage matrix from multiple samples
matrix = cv.coverage_matrix_bins(
    input_files=['sample1.bedgraph', 'sample2.bedgraph'],
    names=['sample1', 'sample2'],
    bins_size=2_000_000
)

# Plot genome-wide coverage
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')
cv.plot_coverage(matrix, genome_size, y_column='sample1')
```

---

## Workflow Guide

### Workflow 1: Multi-Sample Coverage Matrix Analysis

This workflow creates binned coverage matrices for comparing multiple samples.

#### Step 1: Prepare Input Files

CNVis accepts coverage files in these formats:
- **BedGraph**: `chrom start end value` (tab-separated)
- **CSV**: Must contain `chrom`, `start`, `end`, and a value column
- **BigWig**: Standard bigWig format (`.bw`)

#### Step 2: Build Coverage Matrix

```python
import cnvis as cv

# List your coverage files and sample names
input_files = [
    'sample1.bedgraph',
    'sample2.bedgraph',
    'sample3.bedgraph'
]
names = ['sample1', 'sample2', 'sample3']

# Create coverage matrix with 2Mb bins
matrix = cv.coverage_matrix_bins(
    input_files=input_files,
    names=names,
    bins_size=2_000_000,      # 2Mb bins
    max_value=8,               # Clip outliers above 8
    normalize_median=True      # Normalize each sample to median=2
)
```

**Alternative binning options:**
```python
# By chromosome arms (p/q arms)
matrix_arms = cv.coverage_matrix_arms(input_files, names, genome='hg38')

# By gene regions
genes_df = pd.read_csv('genes.bed', sep='\t')  # chrom, start, end, name
matrix_genes = cv.coverage_matrix_genes(input_files, names, genes=genes_df)
```

#### Step 3: Filter Genomic Gaps

Remove or interpolate coverage in problematic regions (centromeres, gaps, etc.):

```python
# Load bundled hg38 gap regions (included with cnvis)
from importlib.resources import files
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')
gap_df = pd.read_csv(gap_file, sep='\t')[['chrom', 'chromStart', 'chromEnd']]

# Or load your own gap file
# gap_df = pd.read_csv('hg38_gaps.tsv', sep='\t')[['chrom', 'chromStart', 'chromEnd']]

# Filter gaps with 100kb buffer
matrix_filtered = cv.filter_gaps(
    matrix,
    gap_df,
    buffer=100_000,        # Extend gap regions by 100kb
    method='neighbor',     # 'neighbor', 'constant', or 'remove'
    gap_value=2,           # Value to use if method='constant'
    window=3               # Window size for neighbor interpolation
)
```

**Gap filtering methods:**
- `'neighbor'`: Interpolate using neighboring bin values (recommended)
- `'constant'`: Fill with a fixed value (default: 2)
- `'remove'`: Drop gap bins entirely from the DataFrame

#### Step 4: Visualize Coverage

**Single sample plot:**
```python
import pandas as pd

# Load genome size file (chrom, size columns)
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')

# Simple plot without color mapping
cv.plot_coverage_points(
    matrix_filtered,
    genome_size,
    y_column='sample1',
    ylim=(0, 4.5),
    alpha=0.3,
    ylabel='Copy Number'
)

# Or with copy number color mapping
palette = cv.get_cn_palette()  # Get default CN color palette
matrix_filtered['color'] = matrix_filtered['sample1'].apply(cv.categorize_cn_color)

cv.plot_coverage_points(
    matrix_filtered,
    genome_size,
    y_column='sample1',
    hue_column='color',
    palette=palette,
    ylim=(0, 4.5),
    alpha=0.3,
    ylabel='Copy Number'
)
```

**Multi-sample comparison:**
```python
cv.plot_coverage_multi(
    matrix_filtered,
    genome_size,
    y_columns=['sample1', 'sample2', 'sample3'],
    ylabels=['Sample 1', 'Sample 2', 'Sample 3'],
    chrom_column='chrom',
    x1_column='start',
    hue_column='color',
    palette=palette,
    ylim=(0, 4.5),
    alpha=0.3,
    showX=False
)
```

**Plot specific chromosomes:**
```python
cv.plot_coverage_multi(
    matrix_filtered,
    genome_size,
    y_columns=['sample1', 'sample2'],
    chrom=['chr1', 'chr2', 'chr3'],  # Only these chromosomes
    chrom_column='chrom',
    x1_column='start',
    hue_column='color',
    palette=palette
)
```

---

### Workflow 2: Segment-Based Smoothing with ASCAT

This workflow integrates ASCAT segmentation results to smooth coverage data.

#### Step 1: Load and Smooth Coverage

```python
import cnvis as cv
import pandas as pd

# Load coverage data
cov = cv.load_coverage_file('sample.csv')

# Load segment data
segment = pd.read_csv('sample.segments.txt', sep='\t')

# Smooth toward segment medians
# smooth=0.9 means 90% toward segment median, 10% original value
cov = cv.smooth_with_segments(
    cov,                         # Coverage DataFrame
    segment,                     # Segment DataFrame
    column='value',              # Input column name
    result_column='value_smoothed',  # Output column name
    smooth=0.9                   # Smoothing factor (0-1)
)
```

#### Step 2: Filter with Blood/Normal Control (Optional)

```python
# Load blood/normal coverage for filtering
blood_cov = cv.load_coverage_file('blood_sample.csv')

# Filter out bins with abnormal blood coverage
cov_filtered = cv.filter_cov(
    cov,
    blood_cov,
    value_column='value',
    chrom_column='chrom'
)
```

#### Step 3: Convert to Copy Number and Assign Colors

```python
# Convert normalized coverage to copy number (diploid = 2)
cov_filtered['cn'] = (cov_filtered['value_smoothed'] * 2).clip(upper=8)

# Calculate segment median for color assignment
cov_filtered['segment_median'] = cov_filtered.groupby('segment')['cn'].transform('median')

# Assign colors based on copy number state
cov_filtered['color'] = cov_filtered['segment_median'].apply(cv.categorize_cn_color)
```

#### Step 4: Plot Smoothed Coverage

```python
palette = cv.get_cn_palette()  # Get default CN color palette

cv.plot_coverage(
    cov_filtered,
    genome_size,
    y_column='cn',
    s=1,                        # Point size
    ylim=(0, 4.5),
    alpha=0.3,
    hue_column='color',
    palette=palette,
    figsize=(5, 0.8),
    ylabel='Copy Number'
)
```

---

### Workflow 3: Quick Segmentation with Built-in Algorithms

For quick exploration without external tools like ASCAT, CNVis provides built-in segmentation.

#### Step 1: Load and Normalize Coverage

```python
import cnvis as cv

# Load coverage data
cov = cv.load_coverage_file('sample.bedgraph')

# Normalize (clip outliers, normalize to median=1)
cov = cv.normalize_coverage(cov, max_value=8, normalize_median=True)
```

#### Step 2: Run Segmentation

```python
# PELT algorithm (fast, recommended for exploration)
segments = cv.segment_coverage(cov, method='pelt', penalty=3)

# Or CBS algorithm (classic CNV method, slower but well-established)
segments = cv.segment_coverage(cov, method='cbs', alpha=0.01)
```

**Method comparison:**
- `'pelt'`: Fast change-point detection using the ruptures library. Good for quick exploration.
- `'cbs'`: Circular Binary Segmentation, the classic algorithm for array CGH data (Olshen et al., 2004). Uses permutation tests for significance.

**Common parameters:**
- `penalty`: For PELT, higher values = fewer breakpoints (default: 3)
- `alpha`: For CBS, significance level (default: 0.01)
- `min_size`: Minimum segment size in bins (default: 5)
- `merge_segments`: Merge adjacent segments that aren't statistically different (default: True)

#### Step 3: Visualize Segments

```python
import pandas as pd

genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')

# Plot segments as horizontal lines
cv.plot_segments(segments, genome_size, y_column='cn', ylim=(0, 4.5))
```

---

## API Reference

### Coverage Processing Functions

| Function | Description |
|----------|-------------|
| `normalize_coverage(track, max_value=8, normalize_median=True, target_median=1.0)` | Clip and/or normalize coverage values |
| `filter_gaps(df, gap, buffer=500_000, method='constant')` | Filter genomic gap regions (methods: 'constant', 'neighbor', 'remove') |
| `filter_cov(cov, blood_cov)` | Filter using control sample |
| `smooth_with_segments(cov, segment, smooth=0.9)` | Segment-based smoothing |
| `segment_coverage(cov, method='pelt')` | Segment coverage using PELT or CBS algorithm |
| `merge_similar_segments(segments, p_threshold=0.05)` | Merge adjacent segments that aren't statistically different |

### Coverage Matrix Functions

| Function | Description |
|----------|-------------|
| `coverage_matrix_bins(input_files, names, bins_size=2_000_000)` | Create matrix with fixed-size bins |
| `coverage_matrix_arms(input_files, names, genome='hg38')` | Create matrix by chromosome arms |
| `coverage_matrix_genes(input_files, names, genes)` | Create matrix by gene regions |
| `coverage_by_bins(input_file, name, bins)` | Process single sample |
| `matrix2comut(matrix, low=1.25, high=2.75)` | Convert to CoMut format |

### Plotting Functions

| Function | Description |
|----------|-------------|
| `plot_coverage(df, genome_size, y_column, ...)` | Single-sample genome-wide plot (main function) |
| `plot_coverage_points(df, genome_size, y_column, ...)` | Scatter plot wrapper (simplified API) |
| `plot_coverage_lines(df, genome_size, y_column, ...)` | Line segment wrapper (simplified API) |
| `plot_segments(segments, genome_size, y_column='cn', ...)` | Plot segmentation results as horizontal lines |
| `plot_coverage_multi(df, genome_size, y_columns, ...)` | Multi-sample stacked plots |
| `categorize_cn_color(value)` | Map CN value to color category |
| `get_cn_palette()` | Get default CN color palette |
| `extract_highlighted_coverage(df, highlight_df, ...)` | Extract coverage from highlighted regions |

### Utility Functions

| Function | Description |
|----------|-------------|
| `load_coverage_file(input_file, chrom_col, start_col, end_col, value_col)` | Load BedGraph/CSV/TSV/BigWig file |
| `genome_range(version='GRCh38')` | Get chromosome ranges |
| `genome_bins(coord_df, bin_size)` | Generate genomic bins |

### Bundled Reference Data

CNVis includes reference data files for hg38:

```python
from importlib.resources import files

# hg38 gap regions (centromeres, telomeres, scaffold gaps)
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')

# GRCh38 chromosome sizes
genome_file = files('cnvis.data').joinpath('GRCh38.genome.size.tsv')
```

---

## Plot Customization

### Styling Options

```python
# Style: spine separators between chromosomes
cv.plot_coverage_multi(df, genome_size, y_columns, style='spine')

# Style: alternating background colors
cv.plot_coverage_multi(
    df, genome_size, y_columns,
    style='facecolor',
    facecolor_odd='#e6f2ff',
    facecolor_even='#ffffff'
)
```

### Common Parameters

| Parameter | Description |
|-----------|-------------|
| `ylim` | Y-axis limits, e.g., `(0, 4.5)` |
| `alpha` | Point transparency (0-1) |
| `s` | Point size |
| `figsize` | Figure size as `(width, height)` |
| `showX` | Show x-axis labels |
| `ylabel` | Y-axis label |
| `highlight_df` | DataFrame of regions to highlight |
| `highlight_color` | Color for highlighted regions |

### Plot Type Selection

CNVis provides wrapper functions for common plot types:

```python
# Scatter plot (points) - best for binned coverage data
cv.plot_coverage_points(df, genome_size, y_column='value', alpha=0.3)

# Line segments - best for segment-level data with start/end coordinates
cv.plot_coverage_lines(df, genome_size, y_column='value', x2_column='end')

# Full control - use the main function directly
cv.plot_coverage(df, genome_size, y_column='value', x2_column='end', ...)
```

---

## Example Notebooks

See the `notebooks/` directory for complete examples:

- `segmentation_algorithms_explained.ipynb` - **In-depth guide to PELT and CBS algorithms**
- `test_segments.ipynb` - Quick segmentation usage examples
- `test_coverage_matrix_2m.ipynb` - Multi-sample coverage analysis
- `test_coverage_plot_smoothed.ipynb` - Segment-based smoothing
- `test_coverage_matrix_plot_hic_vs_wgs.ipynb` - HiC vs WGS comparison
- `test_pacbio_coverage_plot.ipynb` - Long-read coverage plotting
