Metadata-Version: 2.4
Name: vflank
Version: 0.5.0
Summary: Variant-aware flanking-sequence extraction and masking for ddPCR assay design
Project-URL: Homepage, https://github.com/rhshah/vFlank
Project-URL: Documentation, https://rhshah.github.io/vFlank/
Project-URL: Repository, https://github.com/rhshah/vFlank
Project-URL: Issues, https://github.com/rhshah/vFlank/issues
Author: Ronak Shah
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: MAF,bioinformatics,ddPCR,flanking-sequence,fusion,gnomAD,primer-design
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: pandas>=2.0
Requires-Dist: pysam>=0.22
Requires-Dist: rich>=13
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mike>=2.1; extra == 'docs'
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2; extra == 'docs'
Requires-Dist: mkdocs-glightbox>=0.4; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs-mermaid2-plugin>=1.1; extra == 'docs'
Requires-Dist: mkdocs-panzoom-plugin>=0.5; extra == 'docs'
Requires-Dist: mkdocs-section-index>=0.3; extra == 'docs'
Requires-Dist: mkdocs-typer2>=0.1; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Description-Content-Type: text/markdown

# vflank

[![CI](https://github.com/rhshah/vFlank/actions/workflows/ci.yml/badge.svg)](https://github.com/rhshah/vFlank/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/vflank)](https://pypi.org/project/vflank/)
[![GHCR](https://img.shields.io/badge/ghcr.io-vflank-2496ED?logo=docker&logoColor=white)](https://github.com/rhshah/vFlank/pkgs/container/vflank)
[![Docs](https://img.shields.io/badge/docs-mkdocs--material-blue)](https://rhshah.github.io/vFlank/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](pyproject.toml)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/rhshah/vFlank)

**Variant-aware flanking-sequence extraction and masking for ddPCR assay design.**

`vflank` is the *front-end* of a ddPCR assay-design pipeline. It takes genomic
variants — small variants (SNPs/indels) and structural variants (fusions) — and
emits the sequence an assay is designed around: the masked flanks of each variant
or the chimeric junction of a fusion. Primer/probe design itself is delegated
downstream to established tools.

📖 **Documentation: <https://rhshah.github.io/vFlank/>**

## Features

- **Small variants** (`vflank small`) — ±N bp flanks from a MAF, raw + masked
  FASTA, deduplicated per unique variant (`CHR_POS_REF_ALT`).
- **Fusions / SVs** (`vflank fusion`) — reverse-complement-aware junction
  sequences from an iCallSV / iAnnotateSV breakpoint table (columns by name).
- **SNP masking, two backends** — local gnomAD VCFs *or* the gnomAD GraphQL API
  (no download), each with `--pop-data {genome,exome,both}`.
- **Reference, two backends** — a local indexed FASTA *or* the UCSC API
  (`--ref-source api`, no download) for runs with no reference on disk.
- **Patient consensus from a BAM** (`--bam`/`--bam-map`) — build the flank/junction
  from the patient's own reads (hom-ALT corrected, het/low-cov handled) so primers
  match the real template; for both small variants and fusions.
- **No silent failures** — genome-build guard, flank-truncation detection, and a
  categorised skip summary + optional TSV report.

Planned: VCF input (small + BND SV) and downstream emit formats.
See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).

## Install

```bash
pip install vflank                                   # from PyPI (released versions)
pip install git+https://github.com/rhshah/vFlank.git # latest from GitHub
# development:
git clone https://github.com/rhshah/vFlank.git && cd vFlank
pip install -e ".[dev]"
```

Requires Python ≥ 3.10 (Linux/macOS) and `pysam`, `pandas`, `typer`, `rich`.

### Docker

Images are published to GHCR on each release:

```bash
docker run --rm -v "$PWD:/data" ghcr.io/rhshah/vflank \
    small run /data/variants.maf -r /data/GRCh37.fasta -g hg19 -o /data/out.fasta
```

## Quick start

```bash
vflank small run variants.maf \
    --ref-genome /path/to/GRCh37.fasta \
    --pop-vcf-dir /path/to/gnomad_v2.1.1/ \
    --genome-build hg19 \
    --flank 200 \
    --output flanking_sequences.fasta
```

`--genome-build` defaults to **hg19** (GRCh37 / gnomAD v2.1.1); pass `-g hg38`
for GRCh38 / gnomAD v4. gnomAD v4 has no GRCh37 build.

### Masking sources

Common-SNP masking can come from local gnomAD VCFs or the gnomAD API:

- `--pop-source vcf` (default) — local per-chromosome gnomAD VCFs in
  `--pop-vcf-dir`. Reproducible, offline, unlimited scale.
- `--pop-source api` — the public [gnomAD GraphQL API](https://gnomad.broadinstitute.org/api),
  **no download**. Best for small cohorts (rate-limited to ~10 requests/min).

```bash
# No-download masking via the API (small cohorts):
vflank small run variants.maf -r GRCh37.fasta -g hg19 --pop-source api
```

Either source honours `--pop-data {genome,exome,both}` (default `genome`).
`both` masks a position if it is a common SNP in *either* the genome or exome
cohort. Flanks often fall in non-coding regions where only genomes have data,
so `genome` is the default.

### Reference sources

The reference can likewise come from a local file or an API:

- `--ref-source file` (default) — a local indexed FASTA via `--ref-genome`.
  Reproducible, offline, unlimited scale; build sanity-checked by chr1 length.
- `--ref-source api` — the [UCSC API](https://api.genome.ucsc.edu/), **no
  download** (`--ref-genome` not needed). Best for one-off / hosted runs;
  throttled to ~1 request/second, so not for bulk.

```bash
# Fully no-download (reference + masking from APIs):
vflank small run variants.maf -g hg19 --ref-source api --pop-source api
```

Each variant yields two FASTA records (the `__{CHROM}_{POS}_{REF}_{ALT}` suffix
is what keys deduplication; the `{SAMPLE}__` prefix appears only with `--bam`):

```
>[{SAMPLE}__]{GENE}__{HGVSp}__{HGVSc}__{CHROM}_{POS}_{REF}_{ALT}
{left_flank}[REF/ALT]{right_flank}
>Masked__[{SAMPLE}__]{GENE}__{HGVSp}__{HGVSc}__{CHROM}_{POS}_{REF}_{ALT}
{left_flank_masked}[REF/ALT]{right_flank_masked}
```

Chromosome notation (`chr1` vs `1`) is auto-detected from the FASTA and VCFs.
With a local FASTA the genome build is sanity-checked against its chr1 length;
with `--ref-source api` the requested `--genome-build` is trusted (a wrong build
surfaces as a UCSC error, not silent wrong sequence).

## Project layout

```
src/vflank/
├── core/   chrom · variant · flanks · popfreq   (pure, testable domain logic)
├── io/     maf · reference · fasta              (file access)
└── cli/    app · small                          (Typer commands)
```

## Documentation

- [docs/DEVELOPER.md](docs/DEVELOPER.md) — setup, running, testing, using vflank
  as a library, and extending it (new flank sources, CLI commands).
- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) — design, scope boundary, and the
  milestone roadmap.
- `CLAUDE.md` — repository conventions and the quality gate.
