Metadata-Version: 2.4
Name: decryptamp
Version: 2.1.0
Summary: In silico mining of encrypted antimicrobial peptides from proteomes
Author-email: Madson Aragão <madsondeluna@gmail.com>
License: decryptAMP — Software License
        
        Copyright (c) 2026
          Madson Allan de Luna-Aragão (Universidade Federal de Minas Gerais, UFMG)
          Rafael Lucas da Silva (Universidade Federal de Pernambuco, UFPE)
          João Pacífico (Universidade de Pernambuco, UPE)
          Denys Ewerton da Silva Santos (Universidade Federal de Pernambuco, UFPE)
          Luca Monticelli (Centre National de la Recherche Scientifique, CNRS)
          Ana Maria Benko-Iseppon (Universidade Federal de Pernambuco, UFPE)
        
        This software is registered with the Instituto Nacional da Propriedade
        Industrial of Brazil (INPI). Registration is currently pending; the
        registration number will be appended to this file once issued.
        
        ==============================================================================
        Permitted use
        ==============================================================================
        
        1. Academic and non-commercial research use is permitted free of charge,
           provided that:
        
           a) The software is cited in any resulting publication, presentation, or
              derivative work. The recommended citation is provided in the
              CITATION.cff file shipped with this repository.
        
           b) Modifications, redistributions, or derivative works must retain this
              LICENSE file unmodified, give visible credit to the original authors,
              and clearly state that the modified work is not the original.
        
           c) The software is provided "AS IS", without warranty of any kind, express
              or implied, including but not limited to the warranties of
              merchantability, fitness for a particular purpose, and non-infringement.
              In no event shall the authors or copyright holders be liable for any
              claim, damages, or other liability arising from the use of this
              software.
        
        ==============================================================================
        Restricted use
        ==============================================================================
        
        2. Commercial use, including (but not limited to) integration into
           commercial products, services, pipelines sold to third parties, or any
           activity that generates direct or indirect revenue, requires prior written
           authorization from the authors. Requests should be directed to the lead
           author through the project repository:
        
              https://github.com/madsondeluna/decryptAMP
        
        ==============================================================================
        Bundled third-party components
        ==============================================================================
        
        This distribution vendors the AMPidentifier classifier (`ampidentifier/`).
        That component is governed by its own license; refer to the upstream
        AMPidentifier repository for terms.
        
        This distribution also depends on the following open-source libraries,
        each governed by its own license: biopython, modlAMP, scikit-learn,
        xgboost, lightgbm, numpy, pandas, scipy, joblib, tqdm.
        
        ==============================================================================
        INPI registration status
        ==============================================================================
        
        Status:    pending
        Filed at:  Instituto Nacional da Propriedade Industrial (INPI), Brazil
        Number:    to be appended once the registration is issued
        
Project-URL: Homepage, https://github.com/madsondeluna/decryptAMP
Project-URL: Repository, https://github.com/madsondeluna/decryptAMP
Project-URL: Issues, https://github.com/madsondeluna/decryptAMP/issues
Project-URL: Changelog, https://github.com/madsondeluna/decryptAMP/blob/main/CHANGELOG.md
Keywords: antimicrobial peptides,AMP,bioinformatics,proteomics,machine learning,cryptic peptides
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: modlamp
Requires-Dist: tqdm
Requires-Dist: scikit-learn<2.0,>=1.8.0
Requires-Dist: joblib
Requires-Dist: scipy
Requires-Dist: xgboost>=1.6.0
Requires-Dist: lightgbm>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Dynamic: license-file

# decryptAMP

> Bioinformatics tool for the identification and prediction of encrypted Antimicrobial Peptides (ecAMPs) from proteome data.

<table>
  <tr>
    <td align="center">
<pre>
 ██████╗ ███████╗ ██████╗██████╗ ██╗   ██╗██████╗ ████████╗ █████╗ ███╗   ███╗██████╗ 
 ██╔══██╗██╔════╝██╔════╝██╔══██╗╚██╗ ██╔╝██╔══██╗╚══██╔══╝██╔══██╗████╗ ████║██╔══██╗
 ██║  ██║█████╗  ██║     ██████╔╝ ╚████╔╝ ██████╔╝   ██║   ███████║██╔████╔██║██████╔╝
 ██║  ██║██╔══╝  ██║     ██╔══██╗  ╚██╔╝  ██╔═══╝    ██║   ██╔══██║██║╚██╔╝██║██╔═══╝ 
 ██████╔╝███████╗╚██████╗██║  ██║   ██║   ██║        ██║   ██║  ██║██║ ╚═╝ ██║██║     
 ╚═════╝ ╚══════╝ ╚═════╝╚═╝  ╚═╝   ╚═╝   ╚═╝        ╚═╝   ╚═╝  ╚═╝╚═╝     ╚═╝╚═╝     
</pre>
    </td>
  </tr>
</table>

decryptAMP is an end-to-end pipeline that mines proteomes for **encrypted antimicrobial peptides (ecAMPs)**. It performs in silico proteolytic digestion, computes 22 physicochemical and compositional descriptors per peptide, and classifies each peptide using **AMPidentifier** (a tuned soft-voting ensemble of five base classifiers). All results are saved with a complete provenance manifest (JSON) and a self-contained HTML report.

## Table of contents

- [Quick start](#quick-start)
- [Pipeline overview](#pipeline-overview)
- [Installation](#installation)
  - [PyPI](#pypi)
  - [Local install from source](#local-install-from-source)
  - [Docker](#docker)
- [Usage](#usage)
  - [Command-line interface](#command-line-interface)
  - [Examples with bacteria.faa](#examples-with-bacteriafaa)
- [Output layout](#output-layout)
  - [The 22 features](#the-22-features)
  - [Manifest JSON](#manifest-json)
  - [HTML report](#html-report)
- [Scientific notes](#scientific-notes)
- [AMPidentifier models](#ampidentifier-models)
- [Troubleshooting](#troubleshooting)
- [Testing](#testing)
- [Citation](#citation)

## Quick start

```bash
pip install decryptamp

# Run on the bundled E. coli K-12 MG1655 demo proteome (4298 proteins)
decryptamp
```

Outputs land in `results/bacteria/`:

```
results/bacteria/
├── encrypted_peptides_results.csv             # ecAMP candidates with 22 features + probability
├── encrypted_peptides_results_manifest.json   # full provenance (versions, hashes, counts, parameters)
├── encrypted_peptides_results_report.html     # human-readable summary
└── encrypted_peptides_results_dedup_stats.txt # deduplication breakdown

results/bacteria.zip                           # compressed archive of the run directory
```

## Pipeline overview

```
proteome FASTA
      │
      ▼  in silico digestion (trypsin / chymotrypsin / caspase / pseudoenzyme)
encrypted peptides (8-50 aa, canonical residues only)
      │
      ▼  22 physicochemical + compositional descriptors (AMPidentifier)
feature matrix
      │
      ▼  exact deduplication (always) + optional CD-HIT clustering
unique encrypted peptides
      │
      ▼  AMPidentifier classifier (voting / rf / svm / gb / xgb / lgbm)
ecAMP candidates above the decision threshold
```

## Installation

### PyPI

The recommended way to install decryptAMP. Python ≥ 3.10 is required.

```bash
pip install decryptamp
```

The package includes all AMPidentifier model weights (~63 MB). No additional downloads are needed.

Optional: install `cd-hit` for sequence-identity deduplication (`--dedup-cdhit`):

```bash
brew install cd-hit               # macOS
sudo apt-get install cd-hit       # Debian/Ubuntu
conda install -c bioconda cd-hit  # conda
```

### Local install from source

Requirements: Python ≥ 3.10, optional `cd-hit` binary for `--dedup-cdhit`.

```bash
git clone https://github.com/madsondeluna/decryptAMP.git
cd decryptAMP
python -m venv venv
source venv/bin/activate          # Linux/macOS  (Windows: venv\Scripts\activate)
pip install .
```

Optional, for CD-HIT clustering:

```bash
brew install cd-hit               # macOS
sudo apt-get install cd-hit       # Debian/Ubuntu
conda install -c bioconda cd-hit  # conda
```

#### Tested versions

The bundled AMPidentifier weights were generated and validated against the package versions below. `pyproject.toml` declares minimum constraints for installation flexibility, but if a `.pkl` fails to deserialize or numeric results drift unexpectedly, pin to these exact versions:

| Package | Version |
|---|---|
| Python | 3.13.7 |
| biopython | 1.86 |
| joblib | 1.5.2 |
| lightgbm | 4.6.0 |
| modlamp | 4.3.2 |
| numpy | 2.3.4 |
| pandas | 2.3.3 |
| scikit-learn | 1.8.0 |
| scipy | 1.16.3 |
| tqdm | 4.67.1 |
| xgboost | 3.2.0 |
| pytest (dev only) | 9.0.3 |

Quick install of the exact tested set:

```bash
pip install \
  biopython==1.86 joblib==1.5.2 lightgbm==4.6.0 modlamp==4.3.2 \
  numpy==2.3.4 pandas==2.3.3 scikit-learn==1.8.0 scipy==1.16.3 \
  tqdm==4.67.1 xgboost==3.2.0
```

#### Native shell alias (no Docker)

After `pip install .` inside a virtual environment, the `decryptamp` entry point is placed at `<venv>/bin/decryptamp`. To call it from any directory without activating the environment, add an alias pointing to that binary. Adjust the path to where you cloned the repository.

```bash
# Linux/macOS (zsh)
echo "alias decryptamp='/abs/path/to/decryptAMP/venv/bin/decryptamp'" >> ~/.zshrc
source ~/.zshrc

# Linux/macOS (bash)
echo "alias decryptamp='/abs/path/to/decryptAMP/venv/bin/decryptamp'" >> ~/.bashrc
source ~/.bashrc
```

After that, the tool is available everywhere:

```bash
cd /any/working/dir
decryptamp --input myproteome.faa --high-discovery-mode
# Output goes to ./results/myproteome/ in the current working directory.
```

### Docker

The bundled `Dockerfile` is multi-stage, slim, and includes `cd-hit` and the AMPidentifier model weights.

```bash
docker build -t decryptamp .
```

Run the demo proteome (results land in `./results` on the host):

```bash
docker run --rm -v "$PWD/results:/work/results" decryptamp
```

Run on your own proteome (mounted read-only):

```bash
docker run --rm \
  -v "/abs/path/to/proteomes:/data:ro" \
  -v "$PWD/results:/work/results" \
  decryptamp --input /data/myproteome.faa --model voting --high-discovery-mode
```

Pass any decryptAMP flag after the image name; it is forwarded directly to the `decryptamp` entry point.

#### Docker shell alias

For an experience identical to the native install, add an alias that bind-mounts the current working directory as `/data` inside the container. Both the input FASTA and the `results/` output directory then resolve transparently to your host CWD.

```bash
# Linux/macOS (zsh)
echo "alias decryptamp='docker run --rm -v \"\$PWD:/data\" -w /data decryptamp'" >> ~/.zshrc
source ~/.zshrc

# Linux/macOS (bash)
echo "alias decryptamp='docker run --rm -v \"\$PWD:/data\" -w /data decryptamp'" >> ~/.bashrc
source ~/.bashrc
```

Use exactly like the native command:

```bash
cd /any/working/dir
decryptamp --input myproteome.faa --high-discovery-mode
# Output appears in ./results/myproteome/ on the host.
```

This alias keeps containers ephemeral (`--rm`) and produces no Docker-specific footprint in the output directory; files end up owned by your host user on macOS and Linux.

## Usage

### Command-line interface

```
usage: decryptamp [-h] [--input FASTA] [--output NAME] [--results-dir DIR]
                  [--force] [--workers N]
                  [--enzyme {trypsin,chymotrypsin,caspase,pseudoenzyme}]
                  [--model {voting,rf,svm,gb,xgb,lgbm}] [--threshold FLOAT]
                  [--high-discovery-mode] [--no-prediction]
                  [--dedup-cdhit FLOAT] [--keep-redundant] [--list-thresholds]

Mine encrypted antimicrobial peptides (ecAMPs) from proteome data.

input / output:
  --input FASTA         proteome FASTA (default: bundled E. coli demo)
  --output NAME         output CSV name or explicit path
  --results-dir DIR     parent dir for run outputs (default: results)
  --force               overwrite the run directory if it exists
  --workers N           parallel worker processes (default: 8)

digestion:
  --enzyme              cleavage rule (default: trypsin)

prediction:
  --model               AMPidentifier model (default: voting)
  --threshold FLOAT     decision threshold (default: per-model MCC-optimized)
  --high-discovery-mode override threshold to 0.9 (high precision)
  --no-prediction       skip prediction; save all unique peptides with features only

deduplication:
  --dedup-cdhit FLOAT   optional CD-HIT clustering at this identity (e.g. 0.95)
  --keep-redundant      also save the pre-deduplication CSV

utilities:
  --list-thresholds     print per-model MCC thresholds and exit

Run `decryptamp --help` to see the live grouped help in your terminal (with
ANSI colours when stdout is a TTY).
```

| Flag | Default | Description |
|---|---|---|
| `--input PATH` | bundled E. coli demo | Input proteome FASTA. Aborts with a clear error if the file looks like nucleotide data (>90% A/C/G/T/U/N). Reports duplicate IDs and suffixes them with `__dup1`, `__dup2`, etc., without losing data. |
| `--output NAME` | `encrypted_peptides_results.csv` | Output CSV name. If it has no path separator, the file is placed inside the run directory (see `--results-dir`). If it contains a path (e.g. `/tmp/x.csv`), the path is respected literally. |
| `--results-dir DIR` | `results` | Parent directory for run outputs. A subdirectory named after the input filename (without FASTA extension) is created inside. |
| `--force` | off | Overwrite the run directory if it already exists. Without this flag, decryptAMP aborts with a clear error to prevent accidental data loss. |
| `--workers N` | `os.cpu_count()` | Parallel worker processes for digestion. |
| `--enzyme {trypsin,chymotrypsin,caspase,pseudoenzyme}` | `trypsin` | In silico cleavage rule. See [Scientific notes](#scientific-notes) for the regex of each enzyme. |
| `--model {voting,rf,svm,gb,xgb,lgbm}` | `voting` | AMPidentifier model. The voting ensemble (Acc=92.9%, AUC=0.977, MCC=0.859 on validation) is recommended. |
| `--threshold FLOAT` | per-model MCC-optimized | Decision threshold for `ecAMP_Probability`. If omitted, uses the AMPidentifier MCC-optimized threshold for the selected model (e.g. 0.56 for voting). |
| `--high-discovery-mode` | off | Override the threshold with the high-precision discovery setting (0.9). Reduces false positives at the cost of recall. Calibrated for voting; emits a warning when used with other models. Ignored if `--threshold` is given explicitly. |
| `--no-prediction` | off | Skip the AMPidentifier prediction step. Saves all unique encrypted peptides with their 22 features only. |
| `--dedup-cdhit FLOAT` | off | Apply CD-HIT clustering at the given identity threshold (e.g. `0.95`) after exact deduplication. Requires the `cd-hit` binary in `PATH`. |
| `--keep-redundant` | off | Also save the pre-deduplication CSV (one row per peptide occurrence) as `<output>_redundant.csv`. |
| `--list-thresholds` | off | Print the per-model MCC-optimized threshold table and exit without running the pipeline. |

### Examples with bacteria.faa

The bundled demo proteome (`bacteria.faa`) is a 4298-protein RefSeq proteome of *Escherichia coli* str. K-12 substr. MG1655. Numbers below are reproducible with the default seeds and the AMPidentifier weights shipped in this repository.

#### 1. Default run (trypsin + voting + MCC threshold)

```bash
decryptamp
```

```
Run directory: /abs/path/results/bacteria
Selected enzyme for digestion: Trypsin
Loading proteome from: /path/to/decryptamp/example-data/bacteria.faa
Successfully loaded 4298 protein sequences (1330117 aa total).
  Organism (consensus): Escherichia coli str. K-12 substr. MG1655
  Source database: RefSeq
Computing AMPidentifier features for 257845 peptides...
Deduplicating 257845 encrypted peptides...
  Exact dedup: 257845 -> 251756 (2.36% reduction).
Predicting AMP activity with AMPidentifier (VOTING)...
AMPidentifier model loaded: VOTING (threshold=0.56, 22 features).
Found 25784 ecAMPs (out of 251756 unique encrypted peptides) with ecAMP_Probability >= 0.56.
```

| metric | value |
|---|---|
| Proteins input | 4 298 |
| Encrypted peptides generated | 257 845 |
| After exact deduplication | 251 756 |
| ecAMPs predicted (threshold 0.56) | 25 784 |
| Yield per protein | 6.00 |
| Yield per kb of proteome | 19.39 |

#### 2. High-precision discovery (threshold 0.9)

```bash
decryptamp --high-discovery-mode
```

Use this when downstream synthesis or screening is expensive and you want to triage the highest-confidence candidates only. The `voting` ensemble shifts from MCC=0.56 to a fixed 0.9 cutoff.

| metric | default (0.56) | --high-discovery-mode (0.9) |
|---|---|---|
| ecAMPs predicted | 25 784 | 2 711 |
| Yield per protein | 6.00 | 0.63 |
| Yield per kb of proteome | 19.39 | 2.04 |

#### 3. Use a single base classifier instead of the ensemble

```bash
decryptamp --model rf --threshold 0.7
```

Available models with their MCC-optimized thresholds:

| Model | MCC-optimized threshold | Notes |
|---|---|---|
| `voting` | 0.56 | Soft-voting ensemble (recommended) |
| `rf` | 0.56 | Random Forest |
| `svm` | 0.47 | Support Vector Machine (RBF) |
| `gb` | 0.55 | Gradient Boosting |
| `xgb` | 0.48 | XGBoost |
| `lgbm` | 0.71 | LightGBM |

#### 4. Try a different enzyme

```bash
decryptamp --enzyme chymotrypsin
decryptamp --enzyme caspase            # cleaves after D (aspartic acid)
decryptamp --enzyme pseudoenzyme       # random control, fixed seed=42
```

The `pseudoenzyme` setting generates non-overlapping fragments of length sampled uniformly from `[8, 50]` using a fixed-seed RNG (seed=42) for reproducibility. It serves as a negative control to demonstrate that biological enzyme cleavage is non-random.

#### 5. Remove near-duplicate peptides with CD-HIT

```bash
decryptamp --dedup-cdhit 0.95
```

After exact deduplication, near-duplicates differing in 1-2 residues (e.g. missed-cleavage variants of the same core) are collapsed at the given identity threshold. Output gains `Cluster_ID`, `Cluster_Size`, and `Cluster_Members` columns. Typical reduction on bacterial proteomes is 60-80% at 0.95 identity.

#### 6. Audit redundancy before deduplication

```bash
decryptamp --dedup-cdhit 0.95 --keep-redundant
```

Adds `<output>_redundant.csv` with one row per peptide occurrence (before any dedup), useful for tracing each ecAMP back to all source proteins and start positions.

#### 7. Skip prediction (feature-only mode)

```bash
decryptamp --no-prediction
```

Computes the 22 features for every unique encrypted peptide and saves them without filtering. Useful for downstream analyses (PCA, UMAP, clustering, custom classifiers).

#### 8. Override output destination

```bash
decryptamp --output /tmp/my_results.csv
```

When `--output` contains a path separator, the run directory is **not** managed automatically. Sibling artifacts (manifest, HTML report, dedup stats) are written next to the CSV.

#### 9. Multiple proteomes side by side

```bash
decryptamp --input proteomes/Ecoli.faa
decryptamp --input proteomes/Athaliana.faa
decryptamp --input proteomes/Hsapiens.faa
```

Each produces its own subdirectory under `results/` (`Ecoli/`, `Athaliana/`, `Hsapiens/`), so multiple proteomes coexist without overwriting each other.

#### 10. Override the parent results directory

```bash
decryptamp --input data/myproteome.faa --results-dir /scratch/runs --force
```

Useful in HPC setups where outputs should land outside the working directory.

#### 11. Full feature combination on bacteria.faa

A reference command exercising every flag at once. Useful as a smoke test of a fresh installation.

```bash
decryptamp \
    --output ecoli_k12_full.csv \
    --results-dir results \
    --force \
    --workers 8 \
    --enzyme trypsin \
    --model voting \
    --high-discovery-mode \
    --dedup-cdhit 0.95 \
    --keep-redundant
```

This will generate, inside `results/bacteria/`:

```
ecoli_k12_full.csv                   # high-confidence ecAMPs with 22 features
ecoli_k12_full.fasta                 # same candidates as FASTA, score in header
ecoli_k12_full_manifest.json         # full provenance
ecoli_k12_full_report.html           # one-page HTML summary
ecoli_k12_full_dedup_stats.txt       # exact + CD-HIT 0.95 breakdown
ecoli_k12_full_redundant.csv         # pre-deduplication CSV (one row per occurrence)
```

Expected (rounded) on the bundled E. coli K-12 MG1655 demo:

| stage | count |
|---|---|
| Input proteins | 4 298 |
| Encrypted peptides generated | 257 845 |
| After exact deduplication | 251 756 |
| After CD-HIT @ 0.95 | ~50-80 thousand |
| ecAMPs (voting + threshold 0.9) | a few hundred to ~1 thousand |

## Output layout

By default every run creates `results/<input_stem>/`:

```
results/<input_stem>/
├── encrypted_peptides_results.csv             # main output, full feature table (always)
├── encrypted_peptides_results.fasta           # ecAMP sequences with score in header (always)
├── encrypted_peptides_results_manifest.json   # full provenance JSON (always)
├── encrypted_peptides_results_report.html     # self-contained HTML report (always)
├── encrypted_peptides_results_dedup_stats.txt # dedup breakdown (always)
├── encrypted_peptides_results_failed.csv      # only if any peptide was dropped
└── encrypted_peptides_results_redundant.csv   # only if --keep-redundant

results/<input_stem>.zip                       # compressed archive of the run directory (always)
```

The FASTA file is ready for downstream tools (alignment, BLAST, structure prediction) and for synthesis ordering. Header format:

```
>ecAMP_000001 ecAMP_score=0.9876 source=NP_414543.1:682 multiplicity=1 length=11
KLLILARETGR
>ecAMP_000002 ecAMP_score=0.9742 source=NP_414544.1:35 multiplicity=3 length=18
KWKLFKKIEKVGQNVRDG
```

The main CSV contains, for each ecAMP candidate:

| Column | Meaning |
|---|---|
| `Peptide` | amino-acid sequence (8-50 aa, canonical residues only) |
| `Length` | number of residues |
| `Multiplicity` | number of times this peptide was generated across the proteome |
| `Source_Proteins` | semicolon-separated list of source protein IDs |
| `Source_Positions` | parallel list of 1-based start positions |
| `Cluster_ID` | CD-HIT cluster ID (only if `--dedup-cdhit` was used) |
| `Cluster_Size` | number of peptides in the cluster (only with `--dedup-cdhit`) |
| `Cluster_Members` | semicolon-separated peptide sequences in the cluster |
| `Charge`, `pI`, `InstabilityInd`, ... | the 22 AMPidentifier features |
| `ecAMP_Probability` | model probability of being an ecAMP (range 0-1) |
| `ecAMP_Prediction` | binary call (1 if probability ≥ threshold, else 0) |

### The 22 features

| Group | Count | Names |
|---|---|---|
| Global descriptors (modlAMP) | 6 | `Charge`, `pI`, `InstabilityInd`, `AliphaticInd`, `BomanInd`, `HydrophRatio` |
| Hydrophobic moment (modlAMP, Eisenberg, angle 100°) | 1 | `HydrophobicMoment` |
| Grouped amino-acid composition | 9 | `f_acidic`, `f_basic`, `f_polar`, `f_nonpolar`, `f_aliphatic`, `f_aromatic`, `f_charged`, `f_small`, `f_tiny` |
| Free Energy of Transition local (D1) | 3 | `FET_low_D1`, `FET_mid_D1`, `FET_high_D1` |
| Solvent accessibility local (D1) | 3 | `SA_buried_D1`, `SA_exposed_D1`, `SA_inter_D1` |

Charges are computed at pH 7.0 with `amide=True` (matching the AMPidentifier training convention).

### Manifest JSON

Every run writes a complete `_manifest.json` covering tool version, git commit, full command line, input file SHA-256, proteome organism and source database (extracted from FASTA headers), digestion parameters, feature parameters, deduplication statistics, model SHA-256, decision threshold and its source (`mcc-optimized` / `high-discovery` / `user-override` / `deprecated-min-prob`), and SHA-256 of every output artifact.

A typical `pipeline_summary` block:

```json
{
  "n_proteins_input": 4298,
  "n_encrypted_peptides_generated": 257845,
  "n_encrypted_peptides_dropped_nonfinite": 0,
  "n_encrypted_peptides_after_exact_dedup": 251756,
  "n_encrypted_peptides_after_cdhit": null,
  "n_ecamps_predicted": 25784,
  "ecamps_yield_per_protein": 5.999069,
  "ecamps_yield_per_kb_proteome": 19.386758
}
```

The manifest is sufficient to bit-identically reproduce the run from the same input.

### HTML report

A self-contained HTML page (no JavaScript, no external resources, plain CSS) is written next to every CSV. It renders the manifest as a one-page summary with KPI cards, the proteome → encrypted-peptides → unique → ecAMPs flow, organism and source-database metadata extracted from the FASTA, and tables for every parameter used. Suitable for sharing with collaborators or attaching to a manuscript as supplementary material.

Open with any browser:

```bash
open results/bacteria/encrypted_peptides_results_report.html
```

The CSV and JSON outputs are structured as direct inputs for [ecAMPdb](https://github.com/madsondeluna/ecAMPdb), an open database of encrypted antimicrobial peptides covering organisms from all six kingdoms and viruses.

## Scientific notes

**Canonical residues only.** Peptides containing any residue outside the 20 canonical amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) are discarded silently during digestion. This avoids the silent feature bias that arises when ambiguous codes (X, B, Z, J, U, O) are substituted with arbitrary canonical residues.

**Enzymatic cleavage rules.**

| Enzyme | Regex | Description |
|---|---|---|
| `trypsin` | `(?<=[RK])(?!P)` | Cleaves after R or K, not before P |
| `chymotrypsin` | `(?<=[FWY])(?!P)` | Cleaves after F, W, Y, not before P |
| `caspase` | `(?<=D)` | Cleaves after any D (aspartic acid) |
| `pseudoenzyme` | random, seed=42 | Negative control: uniform random fragmentation |

**Length filter.** Generated peptides are kept only if `8 ≤ length ≤ 50` (configurable in `src/decryptamp/config.py`).

**Missed cleavages.** Up to 2 missed cleavages allowed by default (configurable in `src/decryptamp/config.py`).

**Charge calculation.** `pH=7.0`, `amide=True`. The amidation flag matches the AMPidentifier training convention; many natural AMPs (defensins, magainins, cecropins) are C-terminally amidated in vivo, which adds +1 to the net charge.

**Hydrophobic moment.** Computed with the Eisenberg scale and a 100° angle (canonical α-helix amphipathicity).

**Failure handling.** Peptides whose feature vector contains any NaN or Inf value are dropped before classification and logged to `<output>_failed.csv`. The classifier itself raises `ValueError` if NaN/Inf reaches it (defense in depth). Zero-vectors are never silently fed to the model.

**Reproducibility.** All randomness is seeded (pseudoenzyme: 42). Per-model MCC-optimized thresholds are loaded from `src/ampidentifier/models/threshold_<model>.txt`. The manifest records SHA-256 of input, model file, and output CSV.

## AMPidentifier models

The classifier is the bundled AMPidentifier (vendored under `src/ampidentifier/`). The voting ensemble is a soft average of five base learners, each tuned via 5-fold StratifiedKFold and RandomizedSearchCV (`n_iter=50`, `scoring='roc_auc'`).

| Model | Accuracy | AUC-ROC | MCC | Notes |
|---|---|---|---|---|
| Voting (default) | 92.9% | 0.977 | 0.859 | Soft-voting ensemble of the five below |
| Random Forest | 91.9% | 0.972 | 0.839 | |
| Support Vector Machine (RBF) | 91.9% | 0.969 | 0.839 | Uses `StandardScaler` |
| Gradient Boosting | 92.0% | 0.974 | 0.839 | |
| XGBoost | 92.2% | 0.974 | 0.843 | |
| LightGBM | 92.7% | 0.975 | 0.855 | |

Metrics computed on a 20% holdout of the AMPidentifier training set (13 246 peptides total, balanced 6 623 AMP / 6 623 non-AMP).

## Troubleshooting

**`Error: 'X.faa' looks like a nucleotide sequence`** — The input FASTA contains too many A/C/G/T/U/N residues to be a protein. Translate it first (e.g. Prodigal, six-frame translation) or pass a protein FASTA.

**`Error: run directory '...' already exists and is not empty`** — Pass `--force` to overwrite, `--results-dir` to write elsewhere, or `--output PATH` (with separators) to fully override.

**`cd-hit binary not found in PATH`** — Install CD-HIT (`brew install cd-hit`, `apt-get install cd-hit`, `conda install -c bioconda cd-hit`) or omit `--dedup-cdhit`.

**`AmpPredictor received N rows with NaN/Inf in feature columns`** — A feature calculation produced non-finite values for some peptides. The orchestrator should have dropped them upstream; this error indicates a bug. Check `<output>_failed.csv` for context and please open an issue.

**`Warning: --high-discovery-mode applies a fixed threshold of 0.9 calibrated for the voting ensemble`** — You combined `--high-discovery-mode` with a non-voting model. The 0.9 cutoff is calibrated for voting; per-model probability distributions differ. For per-model calibrated cutoffs use `--threshold` explicitly.

**Sklearn version warning when loading models** — The bundled `.pkl` files were trained with scikit-learn ≥ 1.8.0. Older versions still load but may produce slightly different numeric results in edge cases. `pip install --upgrade scikit-learn` to silence.

## Testing

A pytest suite covers the scientific contract of the digestion module, the AMP classifier input validation, and the manifest schema. The default invocation runs only the fast unit tests; opt-in flags expand coverage.

```bash
pip install ".[dev]"   # only needs pytest

# Default: fast unit tests, no model loading (~3 s, 49 tests)
pytest

# Add the slow tests that load the AMPidentifier weights (~30 s)
pytest --run-slow

# Full suite, including end-to-end runs against bacteria.faa (~10 min)
pytest --run-all
```

Test layout:

| File | Coverage | Marker |
|---|---|---|
| `tests/test_peptide_processor.py` | enzyme regexes (trypsin/chymotrypsin/caspase), canonical-AA filter, pseudoenzyme determinism, missed cleavages, 1-based positions | none (fast) |
| `tests/test_amp_predictor.py` | NaN/Inf input validation, missing feature columns, MCC threshold values per model | mostly fast; model-loading tests marked `@slow` |
| `tests/test_manifest.py` | JSON schema completeness, SHA-256 validity, `--no-prediction` handling | none (fast) |

## Citation

If decryptAMP supports your research, please cite:

> Luna-Aragão, M. A., da Silva, R. L., Santos, D. E., Pacífico, J., & Benko-Iseppon, A. M.
> decryptAMP: A bioinformatics tool for the identification and prediction of encrypted Antimicrobial Peptides (ecAMPs) from proteome data.

Repository: https://github.com/madsondeluna/decryptAMP
