Metadata-Version: 2.1
Name: darkprofiler
Version: 0.2.2
Summary: DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.
Author-email: Hanjun Lee <hanjun@alum.mit.edu>
License: MIT
Keywords: proteomics,immunopeptidomics,neoantigen,bioinformatics
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: biopython>=1.78
Requires-Dist: matplotlib>=3.3

# DarkProfiler

**DarkProfiler: Alignment and Classification of Peptides from Reference‑Independent De Novo Peptide Sequencing Experiments**

[![PyPI version](https://badge.fury.io/py/darkprofiler.svg)](https://badge.fury.io/py/darkprofiler)

![DarkProfiler](https://hanjun.group/wp-content/uploads/2025/12/DarkProfiler.png)

DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo peptide sequencing) and classifies them into distinct categories using reference genomes and optional sample‑specific SNVs:

- **Canonical proteome**
- **Alternative splicing**
- **Neoantigens (SNV‑derived mutanome)**
- **Alternative reading frame peptides**
- **Amino acid mismatch**
- **Unknown / unaligned**

DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.

Supported reference assemblies:

- Human: `hg19` (GENCODE release 19), `hg38` (GENCODE release 37)
- Mouse: `mm10` (GENCODE release M19), `mm39` (GENCODE release M37)

The same logic is available both as a **command‑line tool** and as a **Python API**.

---

## Table of contents

1. [Installation](#installation)
   - [Requirements](#requirements)
   - [Install with pip](#install-with-pip-pypi)
   - [Install with conda](#install-with-conda-bioconda)
2. [Reference genome data](#reference-genome-data)
   - [Supported references](#supported-references)
   - [What gets downloaded](#what-gets-downloaded)
3. [Input data](#input-data)
   - [Peptide FASTA](#peptide-fasta)
   - [VCF with SNVs (optional)](#vcf-with-snvs-optional)
   - [Precomputed database directory (optional)](#precomputed-database-directory-optional)
4. [Command‑line usage](#command-line-usage)
   - [`download` subcommand](#download-subcommand)
   - [`run` subcommand](#run-subcommand)
   - [Examples](#examples)
5. [Python API](#python-api)
   - [Function reference](#function-reference)
   - [Python examples](#python-examples)
6. [Classification pipeline details](#classification-pipeline-details)
   - [Overview of steps](#overview-of-steps)
   - [Category definitions](#category-definitions)
7. [Outputs](#outputs)
   - [FASTA category files](#fasta-category-files)
   - [`pieChart.tsv`](#piecharttsv)
   - [`pieChart.pdf`](#piechartpdf)
8. [Database reuse and performance tips](#database-reuse-and-performance-tips)
9. [Troubleshooting](#troubleshooting)
10. [License](#license)
11. [Citation](#citation)

---

## Installation

### Requirements

- **Python**: 3.7+ (tested on modern CPython versions)
- **Operating systems**: Linux, macOS, and other UNIX‑like systems should work. Windows with WSL is recommended.
- **Python dependencies** (installed automatically via pip/conda):
  - [Biopython](https://biopython.org/) (FASTA parsing and sequence utilities)
  - [matplotlib](https://matplotlib.org/) (for `pieChart.pdf`)
  - Standard library modules only otherwise

You also need sufficient disk space to store:

- A **reference genome bundle** per assembly (hundreds of MB)
- The **database directory** (translated proteomes + fast indices) per output folder
- The final classification FASTA files and plots

### Install with pip (PyPI)

```bash
pip install darkprofiler
```

This installs:

- The Python package `darkprofiler`
- The command‑line entry point `darkprofiler`

You should then be able to run:

```bash
darkprofiler --help
```

### Install with conda (bioconda)

```bash
conda install bioconda::darkprofiler
```

This will install DarkProfiler together with all dependencies into the active conda environment.

---

## Reference genome data

### Supported references

DarkProfiler currently supports human and mouse reference assemblies that are aligned to GENCODE releases:

```text
hg19 (GENCODE release 19)
hg38 (GENCODE release 37)
mm10 (GENCODE release M19)
mm39 (GENCODE release M37)
```

The reference is always specified by one of the **lower‑case** strings:

- `hg19`
- `hg38`
- `mm10`
- `mm39`

Internally the reference is normalized to lower case, so `HG38` and `hg38` are treated the same in the Python API, but the CLI restricts choices to the canonical lower‑case names.

### What gets downloaded

Reference data are distributed as versioned ZIP bundles hosted online. You do **not** need to download or unpack them manually. Use:

```bash
darkprofiler download hg38
```

This will:

1. Check that the requested reference is supported.
2. Download a file named like `darkprofiler_hg38.zip` to the installed package directory under `darkprofiler/genome/`.
3. Extract the contents to:

   ```text
   <python-site-packages>/darkprofiler/genome/hg38/
   ```

4. Print progress messages such as:

   ```text
   [darkprofiler] Downloading ...
   [darkprofiler] Extracting to ...
   [darkprofiler] Finished. Reference 'hg38' is now available.
   ```

The extracted directory contains at least the following files (names may include version tags):

- `transcriptome.<reference>.fa` – all reference transcripts (FASTA)
- `transcriptome.<reference>.cds.bed` – CDS segments per transcript
- `knownCanonical.<reference>.list` – list of canonical transcript IDs
- `gencode.<reference>.gff` – GENCODE annotation (GFF/GTF‑like)
- `exome.<reference>.bed` – exome intervals used to filter SNVs

These files are used internally by the pipeline; you normally don’t need to interact with them directly.

> **Note:** If the `download` step has not been run for a given reference, `darkprofiler run` will fail with an error such as *“Could not find file ... in genome root”*.

---

## Input data

### Peptide FASTA

The primary input is a FASTA file containing **peptide sequences** to classify:

```text
>peptide_1
LLLLGIGGTFK
>peptide_2
EAVAEQAALR
...
```

Requirements and recommendations:

- Each record is interpreted as a **peptide** (amino‑acid sequence).
- FASTA IDs are kept as‑is and propagated to the output files.
- Sequences are upper‑cased internally; non‑standard characters are not specially treated.
- Empty sequences are silently ignored.
- There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.

A peptide sequence is assigned to **at most one output category**, corresponding to the first category that matches in the pipeline
(canonical → alternative splicing → neoantigen → alternative reading frame → amino acid mismatch → unknown).

### VCF with SNVs (optional)

To classify **neoantigens** (peptides derived from sample‑specific single nucleotide variants), you can provide a VCF file via `--vcf-path` / `vcf_path`:

- Accepts plain or gzipped VCF: `*.vcf` or `*.vcf.gz`.
- Only **SNVs** (single‑base reference and single‑base alternate) are used.
- Multi‑allelic entries are expanded and processed per ALT allele.
- Non‑SNV variants (indels, MNVs, etc.) are ignored.
- Coordinates are matched to the reference via chromosome names that are normalized to strip the `chr` prefix (`chr1` → `1`).

DarkProfiler additionally filters SNVs to the **coding exome** using the `exome.<reference>.bed` file if present:

- Only SNVs whose positions overlap the exome intervals are retained.
- If no exome BED is available, all SNVs are accepted.

If `vcf_path` is omitted or points to a non‑existing file:

- The SNV list is empty.
- The mutanome and neoantigen steps still run, but represent the unmodified reference sequence.

### Precomputed database directory (optional)

By default, each `darkprofiler run` invocation builds a **database** in:

```text
<output_dir>/database/
```

The database contains translated and derived proteomes as FASTA files:

- `canonicalProteome.fa`
- `alternativeSplicing.fa`
- `mutanome.fa`
- `mutatedCanonicalTranscriptome.fa`
- `mutatedAlternativeTranslatome.fa`
- `mutatedAlternativeORFeome.fa`

DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
for example:

- `canonicalProteome.idx/`
- `alternativeSplicing.idx/`
- `mutanome.idx/`
- `mutatedAlternativeORFeome.idx/`

If you run DarkProfiler repeatedly with the **same reference and SNV set**, you can re‑use a prebuilt database to avoid recomputation by passing `--database-path` / `database_path`:

```bash
darkprofiler run hg38 peptides.fa out --database-path prebuilt_db/
```

The directory is accepted **only if all required files are present**. Otherwise:

- DarkProfiler prints a warning that the directory is missing files or is invalid.
- The directory is ignored.
- A new database is built from scratch under `<output_dir>/database`.

---

## Command‑line usage

The installed CLI is called `darkprofiler`.

Run `darkprofiler --help` to see the top‑level usage:

```text
usage: darkprofiler [-h] {download,run} ...
```

Two subcommands are available:

- [`darkprofiler download`](#download-subcommand) – download reference genome bundles.
- [`darkprofiler run`](#run-subcommand) – run the classification pipeline.

### `download` subcommand

```bash
darkprofiler download hg38
```

### `run` subcommand

```bash
darkprofiler run hg38 peptides.fa output_dir \
  --vcf-path sample.vcf.gz \
  --database-path /path/to/database \
  --num-threads 8 \
  --hamming 2
```

**Optional arguments**

- `--vcf-path FILE`

  Optional path to a VCF or VCF.GZ file with SNVs.

- `--database-path DIR`

  Optional path to an existing database directory containing the required FASTA files listed above.

- `--num-threads N` (default: `1`)

  Number of worker threads used during peptide search / verification.

- `-k, --hamming {0,1,2}` (default: `0`)

  Maximum Hamming distance allowed for peptide matching.  
  `0` performs exact matches only; `1` and `2` allow up to one or two amino‑acid substitutions.

---

## Python API

```python
from darkprofiler.run import classify_peptides

classify_peptides(
    reference="hg38",
    peptide_fasta="peptides.fa",
    output_dir="output",
    vcf_path=None,
    database_path=None,
    num_threads=4,
    hamming_distance=0,
)
```

---

## Classification pipeline details

### Overview of steps

1. Filter VCF to exome  
2. Load transcriptome, CDS annotations, canonical transcript list  
3. Build canonical / non‑canonical transcript sets  
4. Build canonical proteome (CDS must start with `ATG`) and classify peptides  
5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides  
6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides  
7. Build alternative ORFs (3 frames) and classify peptides  
8. Identify amino acid mismatch using Hamming distance (`k` in 0–2)  
9. Write unaligned peptides and summary plots  
10. Finalize

### Category definitions

- **CDS translation filter (`ATG`)**  
  For CDS‑based proteomes (canonical proteome, alternative splicing, mutanome), CDS translations are included only when the CDS begins with `ATG`. This reduces false positives from incomplete or mis‑annotated CDS records.

- **ORF region labels**  
  For alternative ORF hits, DarkProfiler labels the peptide start as:
  - `uORF` (upstream of CDS start)
  - `intORF` (inside annotated CDS span)
  - `dORF` (downstream of CDS end)
  - `lncRNA` (no CDS annotation)

---

## Outputs

All outputs live in the specified `output_dir`.

### FASTA category files

Each category is represented by a separate FASTA file in `output_dir`:

- `canonicalProteome.fa`
- `alternativeSplicing.fa`
- `neoantigen.fa`
- `alternativeReadingFrame.fa`
- `aminoAcidMismatch.fa`
- `unknown.fa`

For classification FASTAs (all except `unknown.fa`), each record uses:

```text
> referencePeptide | TranscriptID | nucleotide coordinate on transcript | uORF/intORF/dORF/lncRNA/CDS
queryPeptide
```

- **referencePeptide**: matched reference peptide sequence (substring from the reference proteome/ORF; same length as the query)
- **TranscriptID**: transcript identifier (for alternative ORFs, this is the underlying transcript)
- **nucleotide coordinate on transcript**: 1‑based transcript coordinate of the peptide start codon (frame‑aware for alternative ORFs)
- **uORF/intORF/dORF/lncRNA/CDS**:
  - `CDS` for canonical proteome / alternative splicing / neoantigen hits
  - `uORF`, `intORF`, `dORF`, `lncRNA` for alternative ORF hits

Example:

```text
> GILGFVFTL | ENST00000335137.4 | 1234 | CDS
GILGFVFTL
```

`unknown.fa` uses the original peptide IDs and sequences without additional fields.

### `pieChart.tsv`

A tab‑separated summary file with one line per category:

```text
Category    Count
canonical   123
alternativeSplicing 45
neoantigen  7
alternativeReadingFrame 32
aminoAcidMismatch 10
unknown     83
```

### `pieChart.pdf`

A pie chart illustrating the fraction of peptides in each category is saved as `pieChart.pdf`.

---

## Database reuse and performance tips

- **Reuse databases**  
  Use `--database-path` to reuse a database directory containing the required FASTA files.

- **Persistent fast indices**  
  DarkProfiler builds on‑disk indices (`*.idx/`) for fast peptide lookup with Hamming distance ≤ 2 using a pigeonhole (seed‑and‑verify) strategy.
  When an index directory exists, it is reused automatically.

- **Multi‑threading**  
  Increase `--num-threads` to speed up peptide search / verification on multi‑core machines.

---

## Troubleshooting

**Unsupported reference**

- The reference must be one of `hg19`, `hg38`, `mm10`, `mm39`.

**Missing genome files**

- Run `darkprofiler download <reference>` in the same environment.

**Large runtime**

- Increase `--num-threads`.
- Use `-k/--hamming 0` for exact matching only when appropriate.
- Reuse databases and indices between runs.

---

## License

DarkProfiler is released under the **MIT License**.

---

## Citation

If you use DarkProfiler in a scientific publication, please cite it as:

(Updated citation information will be provided once an associated preprint or manuscript is available.)
