Metadata-Version: 2.4
Name: eplacer
Version: 0.1.0
Summary: Machine learning platform for taxonomic classification
Author: Christopher C Powers
Author-email: christopher.powers@noaa.gov
Classifier: Programming Language :: Python :: 3
Classifier: License :: Public Domain
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: argh
Requires-Dist: docopt>=0.6.2
Requires-Dist: pytorch>=2.5
Requires-Dist: torchvision
Requires-Dist: torchinfo
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: networkx
Requires-Dist: shapely
Requires-Dist: matplotlib
Requires-Dist: sympy
Requires-Dist: tqdm
Requires-Dist: pyyaml
Requires-Dist: requests
Requires-Dist: click
Requires-Dist: pygeohash
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

## ePlacer

ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.

### Why use ePlacer

The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.

Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the [MiFish](https://doi.org/10.1007/s12562-020-01461-x) and the [ecoPrimer, or Riaz,](https://doi.org/10.1093/nar/gkr732) marker gene regions. For these two regions, ePlacer offers the following benefits:

* **Interoperability.** ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
* **Portability.** ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
* **Interactive Visualization.** ePlacer provides an interactive GUI and curation tool that allows  
* **Increased Accuracy.** The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
* **Trainability** In addition to the two provided barcodes, this code repository provides tools for training new models.

For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!

### Installation
Users can install the current version of ePlacer with conda.
```bash
conda install bioconda::eplacer
```

### Using ePlacer for classification
The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a [QIIME2](https://github.com/NEFSC/PEMAD-PBB-q2-ePlacer) plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.

ePlacer taxonomically classified ASV sequences using two distinct types of information:
- Sequence information (inferred from ASVs)
- Biogeography (inferred from sample metadata and count tables)

Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.

Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels. 

In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
- A fasta file of ASVs
```bash
>ASV1
CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
>ASV2
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
>ASV3
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
```
- A geography metadata file
```bash
#SampleID	Latitude	Longitude
Sample1	39.645946	-71.746641
Sample2	39.645946	-71.746641
```
- A count table
```bash
#OTU ID	Sample1	Sample2
ASV1	15	0
ASV2	5	22
ASV3	0	10
```
- blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
```bash
ASV1	SubjectRef_A	100.00	1.45e-45	98	98	98	1	98	1	98	GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
ASV2	SubjectRef_B	99.00	2.12e-42	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
ASV3	SubjectRef_C	100.00	1.45e-45	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
```
#### Acquiring pre-trained models.
Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.

Natively trained models contain directories of information and can be obtained in the following manner:
```bash
wget https://zenodo.org/records/20820029/files/mifish.tar.gz
tar -xzf mifish.tar.gz
wget https://zenodo.org/records/20820029/files/riaz.tar.gz
tar -xzf riaz.tar.gz
```
Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.

#### Running Classification with Pre-trained models
For users that have generated their own models, use the following code:
```bash
eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
```

#### Training new ePlacer models
Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the [OBIS csv download](https://obis.org/data/access/)). 

ePlacer also supports custom references for biogeography, formatted as follows:
```bash
#Species	Latitude	Longitude
SpeciesLabelA	39.645946	-71.746641
SpeciesLabelB	39.645946	-71.746641
```

To run the training, use the following:
```bash
eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
            --out <output directory> --taxlevel SPECIES \
            --geoData <obis data> --augments <Several parameters should be test here> \
            --maskrate <Several parameters should be test here> --threads 1
```

==============================================================

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.
