Metadata-Version: 2.4
Name: exomeflow
Version: 1.0.0
Summary: Production-quality Whole Exome Sequencing analysis pipeline
Author-email: Robin Tomar <robin@aiims.ac.in>
License-Expression: MIT
Project-URL: Homepage, https://github.com/robintomar/exomeflow
Project-URL: Repository, https://github.com/robintomar/exomeflow
Project-URL: Bug Tracker, https://github.com/robintomar/exomeflow/issues
Keywords: bioinformatics,WES,NGS,genomics,exome,variant-calling
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pysam>=0.22.0
Dynamic: license-file

# ExomeFlow

**Production-quality Whole Exome Sequencing (WES) analysis pipeline**

> Author: Robin Tomar, AIIMS New Delhi  
> License: MIT

---

## Overview

ExomeFlow is a Python package that wraps a complete WES analysis workflow into a single, reproducible CLI command. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution.

```
FASTQ
 └─ fastp (QC + trimming)
     └─ BWA MEM (alignment)
         └─ GATK SortSam (coordinate sort)
             └─ samtools flagstat (alignment QC)
                 └─ GATK MarkDuplicates
                     └─ GATK BuildBamIndex
                         └─ GATK BQSR (BaseRecalibrator + ApplyBQSR)
                             └─ GATK HaplotypeCaller (variant calling)
                                 └─ GATK VariantFiltration (hard filters)
                                     └─ ANNOVAR (functional annotation)
```

---

## Requirements

### System dependencies (must be on `PATH`)

| Tool | Version tested |
|------|---------------|
| `bwa` | ≥ 0.7.17 |
| `samtools` | ≥ 1.17 |
| `gatk` | 4.6.x |
| `fastp` | ≥ 0.23 |
| Perl + ANNOVAR | `table_annovar.pl` |

### Python

- Python ≥ 3.9
- See `requirements.txt` for Python dependencies

---

## Installation

### From PyPI

```bash
pip install exomeflow
```

### From source

```bash
git clone https://github.com/robintomar/exomeflow.git
cd exomeflow
pip install -e .
```

---

## Reference files required

| File | Description |
|------|-------------|
| `hg38.fa` | BWA-indexed reference genome |
| `dbsnp.vcf.gz` | dbSNP (bgzipped + tabix-indexed) |
| `Mills_and_1000G_gold_standard.indels.hg38.vcf.gz` | Mills indels |
| `Homo_sapiens_assembly38.known_indels.vcf.gz` | Known indels |
| Exome capture BED | e.g. `Illumina_Exome_TargetedRegions_v1.2.hg38.bed` |
| ANNOVAR humandb | `hg38` annotation databases |

---

## Input FASTQ naming convention

ExomeFlow automatically detects samples from paired-end FASTQ files:

```
fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz
```

Pattern: `<sample_id>_1.fastq.gz` / `<sample_id>_2.fastq.gz`

---

## Usage

### Minimal example

```bash
exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb
```

### Full example with all options

```bash
exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --intervals /refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
  --interval-padding 100 \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb \
  --threads 32 \
  --fastp-threads 8 \
  --annovar-threads 24 \
  --max-workers 2 \
  --java-opts "-Xmx80g"
```

### Check version

```bash
exomeflow --version
```

### Help

```bash
exomeflow run --help
```

---

## Output files

After a successful run the `results/` directory contains:

```
results/
├── QC/                          # fastp HTML/JSON reports (reserved)
├── filtered_fastp/
│   ├── <sample>_1_filtered.fastq.gz
│   ├── <sample>_2_filtered.fastq.gz
│   ├── <sample>_fastp.html
│   └── <sample>_fastp.json
├── Mapsam/
│   ├── <sample>_recalibrated.bam   ← use in IGV for variant validation
│   └── <sample>_recalibrated.bam.bai
├── VCF/
│   ├── <sample>.vcf                          ← raw HaplotypeCaller output
│   ├── <sample>_PASS.vcf                     ← PASS-only hard-filtered variants
│   ├── <sample>.annovar.hg38_multianno.vcf   ← annotated VCF
│   └── <sample>.annovar.hg38_multianno.txt   ← annotated tab-delimited table
├── logs/
│   ├── analysis_<timestamp>.log   ← full pipeline log
│   ├── errors_<timestamp>.log     ← errors only
│   └── <sample>_<timestamp>.log   ← per-sample log
└── .checkpoints/                  ← resume state (do not delete during a run)
```

---

## Checkpointing & resuming

ExomeFlow writes a checkpoint file for every completed step. If the pipeline
is interrupted (power failure, wall-time limit, etc.) simply re-run the
**exact same command** — completed steps are skipped automatically.

---

## GATK hard-filter thresholds

### SNPs

| Filter name | Expression |
|-------------|-----------|
| `SNP_LowQD` | `QD < 2.0` |
| `SNP_StrandBias` | `FS > 60.0` |
| `SNP_StrandOddsRatio` | `SOR > 3.0` |
| `SNP_LowMQ` | `MQ < 40.0` |
| `SNP_MQRankSum` | `MQRankSum < -12.5` |
| `SNP_ReadPosRankSum` | `ReadPosRankSum < -8.0` |
| `LowDepth` | `DP < 10` |
| `LowGQ` *(genotype)* | `GQ < 20` |

### INDELs

| Filter name | Expression |
|-------------|-----------|
| `INDEL_LowQD` | `QD < 2.0` |
| `INDEL_StrandBias` | `FS > 200.0` |
| `INDEL_StrandOddsRatio` | `SOR > 10.0` |
| `INDEL_ReadPosRankSum` | `ReadPosRankSum < -20.0` |
| `LowDepth` | `DP < 10` |
| `LowGQ` *(genotype)* | `GQ < 20` |

---

## ANNOVAR annotation databases (default)

```
refGene, dbnsfp47a, clinvar_20240416, gnomad41_exome,
gnomad41_genome, avsnp150, cosmic84_coding, exac03
```

---

## Publishing to PyPI

```bash
pip install build twine

# Build source + wheel distributions
python -m build

# Upload to PyPI (requires ~/.pypirc or TWINE_USERNAME / TWINE_PASSWORD env vars)
twine upload dist/*
```

To publish to TestPyPI first:

```bash
twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ exomeflow
```

---

## Development

```bash
# Install in editable mode with dev extras
pip install -e ".[dev]"

# Lint
flake8 exomeflow/
mypy exomeflow/
```

---

## Citation

If you use ExomeFlow in your research, please cite:

> Robin Tomar. *ExomeFlow: a production-quality whole exome sequencing pipeline*. AIIMS New Delhi, 2025.
