Metadata-Version: 2.4
Name: METACARP
Version: 1.0.0
Summary: Workflow to screen shotgun metagenomic samples for the presence of biological impurities.
Author-email: Jolien D'aes <bioit@sciensano.be>
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/BioinformaticsPlatformWIV-ISP/METACARP
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PuLP==2.7.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: beautifulsoup4>=4.11.1
Requires-Dist: biopython
Requires-Dist: numpy>=1.26.4
Requires-Dist: pandas>=2.1.0
Requires-Dist: pysam==0.23.3
Requires-Dist: nanofilt==2.8.0
Requires-Dist: snakemake==7.18.2
Requires-Dist: yattag>=1.14.0
Requires-Dist: pydantic==2.12.5
Requires-Dist: humanize==4.15.0
Requires-Dist: click==8.3.1
Requires-Dist: bs4==0.0.2
Requires-Dist: screed==1.1.2
Dynamic: license-file

# MetaCARP

MetaCARP (Metagenomic Contamination-Assessment-of-Retail-Products) is a workflow to screen shotgun metagenomic sequencing data,
for the presence of biological impurities, including allergens. The workflow was developed for the analysis 
of sequencing data derived from commercial vitamin-containing food products.

Fun fact: the common carp scavenges food by rooting in sediment and mouthing the contents to identify food items ([source](https://en.wikipedia.org/wiki/Common_carp)).

DISCLAIMER: This pipeline comes without any guarantees, and no legal claims can be made based on the results obtained 
with this pipeline. Potential contaminations reported by this pipeline can be false positive,
and their presence must always be verified through additional (validated) assays.

### MetaCARP is also available on our public [Galaxy instance](https://galaxy.sciensano.be/) (registration required).

----

## INSTALLATION

### CONDA installation

```
conda install -c bioconda -c conda-forge metacarp
```

If the above command fails, MetaCARP can be installed in a new environment using the following commands:

```
conda create -n metacarp python=3.10
conda activate metacarp
conda install bioconda::metacarp -c bioconda -c conda-forge
```

Note that the conda installation does not include a Kraken2 Database. 
More information on how to build a Kraken2 Database is available in the [Kraken2 Manual](https://github.com/DerrickWood/kraken2/wiki/Manual#kraken-2-databases).

### Manual installation

The MetaCARP workflow has the following dependencies:
- [Kraken2 2.1.1](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.1)
- [Minimap2 2.26](https://github.com/lh3/minimap2)
- [NCBI Datasets 16.40.1](https://github.com/ncbi/datasets/releases/tag/v16.40.1)
- [Fastp 0.23.4](https://github.com/OpenGene/fastp/releases/tag/v0.23.4)
- [Samtools 1.17](https://github.com/samtools/samtools/releases/tag/1.17)

The GMM screening pipeline has the following additional dependencies:
- [KMA 1.4.18](https://bitbucket.org/genomicepidemiology/kma)

The corresponding binaries should be in your PATH to run the workflow. 
Other versions of these tools may work, but have not been tested.

Python 3.10 is recommended.

```
virtualenv metacarp_env --python=python3.10;
. metacarp_env/bin/activate;
pip install metacarp;
```

## USAGE

```
usage: METACARP [--ilmn-in ILMN_IN] [--ont-in ONT_IN] [--dir-working DIR_WORKING] --output OUTPUT [--output-html OUTPUT_HTML] 
            --kraken-db KRAKEN_DB [--usual-suspects USUAL_SUSPECTS] [--allergens ALLERGENS] 
            [--cutoff-allergens CUTOFF_ALLERGENS] [--cutoff-unclassified CUTOFF_UNCLASSIFIED] 
            [--cutoff-prok-fungi CUTOFF_PROK_FUNGI] [--cutoff-euk CUTOFF_EUK] 
            [--cutoff-confidence CUTOFF_CONFIDENCE] [--threads THREADS] [--version]

options:
  --ilmn-in ILMN_IN     Directory with Illumina input FASTQ files
  --ont-in ONT_IN       Directory with ONT input FASTQ files
  --dir-working DIR_WORKING
                        Working directory
  --output OUTPUT       Output directory
  --output-html OUTPUT_HTML 
                        Output report name
  --kraken-db KRAKEN_DB Directory with Kraken DB
  --usual-suspects USUAL_SUSPECTS   
                        TSV file with species, taxid and group (Bacteria, Fungi, Plants, or Animals) of usual suspects 
                        (default: resources/usual_suspects.tsv)
  --allergens ALLERGENS 
                        TSV file with taxids of allergens (default: resources/allergens.tsv)
  --cutoff-allergens CUTOFF_ALLERGENS', type=int, default=1, help='Minimal relative abundance (in %) of allergens detected with Kraken2 to report')
  --cutoff-unclassified CUTOFF_UNCLASSIFIED
                        Warning threshold for relative abundance (in %) of unclassified reads detected with Kraken2 (default: 5)
  --cutoff-prok-fungi CUTOFF_PROK_FUNGI
                        Minimal relative abundance (in %) of prokaryotic or fungal species detected with Kraken2 to select for read mapping (default: 0.1)
  --cutoff-euk CUTOFF_EUK
                        Minimal relative abundance (in %) of eukaryotic (non-fungal) species detected with Kraken2 to select for read mapping (default: 1)
  --cutoff-confidence CUTOFF_CONFIDENCE
                        JSON configuration file with cutoffs for high and low confidence detection (default: resources/confidence_cutoffs.json)                  
  --threads THREADS
  --version             Print version and exit
```

### Basic usage example

The MetaCARP workflow processes Illumina and/or ONT FASTQ files as input. 
Illumina data can be provided using the `--ilmn-in` option, ONT data can be provided using the `--ont-in` option.
All input files should be gzipped, and Illumina file names should be formatted as `{samplename}_S*_R1_*.fastq.gz` and `{samplename}_S*_R2_*.fastq.gz`, or `{samplename}_1.fastq.gz` and `{samplename}_2.fastq.gz`.
```
METACARP \
    --ilmn-in in/ilmn/ \
    --ont-in in/ont/ \
    --output output/ \
    --dir-working work/ \
    --threads 8
```

### GMM screening workflow

An additional script is available to screen shotgun metagenomic samples for the presence of known genetically modified microorganisms (GMM) based on a set of 'junction' sequences and marker genes.
This workflow employs kma for read-based gene detection and relies on the 'Camel' code base developed by the Bioinformatics Platform of Sciensano.
 
```
usage: GMM_screening [--ilmn-in ILMN_IN] [--ont-in ONT_IN] [--dir-working DIR_WORKING] --output OUTPUT
            [--output-tsv OUTPUT_TSV] [--db DB] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--threads THREADS]

options:
  --ilmn-in ILMN_IN     Directory with Illumina input FASTQ files
  --ont-in ONT_IN       Directory with ONT input FASTQ files
  --dir-working DIR_WORKING
                        Working directory
  --output OUTPUT       Output directory
  --output-tsv OUTPUT_TSV 
                        Output report name
  --db DB Directory with GMM detection database - either the database 'junctions' (resources/DB_GMM/V6/junctions) or 'genes-vectors' with complete genes and vectors (resources/DB_GMM/V6/genes-vectors) can be chosen   
  --min-identity MIN_IDENTITY   Minimal % identity with template to report kma hit (default: 90)
  --min-coverage MIN_COVERAGE   Minimum % coverage of template to report kma hit (default: 90)            
  --threads THREADS
```


## CONTACT
[Create an issue](https://github.com/BioinformaticsPlatformWIV-ISP/METACARP/issues) to report bugs, propose new functions or ask for help.

## CITATION
If you use this tool, please consider citing our [publication](https://doi.org/10.1016/j.lwt.2025.118371).

-----

Copyright - 2026 Jolien D'aes <jolien.daes@sciensano.be>
