{% extends 'base.html' %} {% block stylelink %}"{{ url_for('static', filename='css/documentation_style.css') }}"{% endblock %} {% block script %} "{{url_for('static', filename='js/doc_functions.js')}}" {% endblock %} {% block content %}

General Information

If you'd like to contact us about expanding this set, please email either chg60@pitt.edu or laa89@pitt.edu.

Report an issue.


DEPhT is a new tool for identifying prophages in bacteria, and was developed with a particular interest in being able to rapidly scan hundreds to thousands of genomes and accurately extract complete (likely active) prophages from them.
A detailed manuscript has been submitted to Nucleic Acids Research, but in brief DEPhT works by using genome architecture (rather than homology) to identify genomic regions likely to contain a prophage. Any regions with phage-like architecture (characterized as regions with high gene density and few transcription direction changes) are then further scrutinized using two passes of homology detection. The first pass identifies genes on putative prophages that are homologs of (species/clade/genus-level) conserved bacterial genes, and uses any such genes to disrupt the prophage prediction. The second pass (disabled in the 'fast' runmode) identifies genes on putative prophages that are homologs of conserved, functionally annotated phage genes. Finally, prophage regions that got through the previous filters are subjected to a BLASTN-based attL/attR detection scheme that gives DEPhT better boundary detection than any tool we are aware of.

Contents



Running DEPhT

Arguments

In order to run DEPhT, you will need to provide two arguments:
  1. One or more genome sequences in either FASTA or Genbank flatfile format
  2. A desired output directory

DEPhT will infer the input file type(s) when it parses the files, not using the file extensions. As far as we are aware, this makes DEPhT somewhat unusual among prophage-detection tools, as in a single run you can provide a set of files with multiple file formats. FASTA files will be treated as un-annotated and the sequences parsed from these input files will be auto-annotated prior to prophage detection. Genbank flatfiles will be treated as annotated genomes, and will therefore bypass the auto-annotation step and run ~20-30 seconds faster than their FASTA counterparts.

In the event that a prophage region is discovered, or if the "Dump data" option is selected, DEPhT will create a directory at the specified output directory for each of the input sequences. For those sequences that have predicted prophages, DEPhT will write an .html file with a visualization of the discovered prophage region(s). It will also output a FASTA (sequence) file and a Genbank (annotation) file for each extracted prophage sequence. See below for more details DEPhT's output files.

Other options

What follows is a description of DEPhT's optional arguments. These are described in isolation, but can be mixed and matched using different values to specifically tune the behavior of DEPhT to suit your needs. Default parameters were all set to optimize performance in Mycobacterium genomes.

Output

DEPhT's output consists of three main files:
  1. An .html file with a visualization of the discovered prophage regions
  2. A .csv spreadsheet with the primary data used to discern prophage regions - one file per contig
  3. A .gbk Genbank flatfile with DEPhT's annotation of the inputted sequence - one file per contig
DEPhT's graphical .html output displays a cirular input genome map and linear phage region genome map with DnaFeaturesViewer as well as the coordinates of the regions discovered in a colored table with pretty-html-table.

In each of these genome maps and coordinate tables, prophage and/or protein-coding sequence features are colored green for forward-oriented features and colored red for reverse-oriented features. Above those prophage features in the circular genome map is annotated the prophage region name as given by DEPhT. Above those protein-coding features in the linear genome map(s) is annotated phage products as identified by DEPhT.
DEPhT's data .csv output contains data for each protein-coding feature in the inputted sequence file.
The columns in this output are the following:



Training New Models

Selection of Training Genomes

This is by far the highest hurdle for training new models. The better the training genomes are selected, the better the model will perform. We highly recommend only training against completely sequenced bacteria and manually annotated phages.
There's an important tradeoff you'll need to make when training models: volume of data versus quality of data. A relatively small dataset (~100 phages and 30-45 bacteria) can yield incredibly high-quality models if the genomes are chosen well and especially if the phage genomes are well-annotated. Assuming all the training data is high-quality, increasing the amount of training data will likely improve the quality of predictions made by DEPhT, with the caveat that larger models will necessarily increase the DEPhT runtime, which will be most noticeable in the fast runmode. Ok so let's suppose you want to train a new model for Mycobacteria. A good start would be to head to PATRIC and navigate to the Mycobacteriaceae.

Retrieve Bacterial Genomes

In the taxonomy tree, the steps to get here are:
Terrabacteria group >> Actinobacteria >> Actinomycetia >> Corynebacteriales >> Mycobacteriaceae
The red box below shows where to click to get to the home page for the family or genus of interest.
From there, navigate to the "Genomes" tab to see all the available genomes in the chosen taxon. Click "Filters", and a good choice might be to select only those genomes where "Genome Status" is "Complete", and "Reference Genome" is either "Representative" or "Reference", and "Genome Quality" is "Good". Hit "Apply" to apply those filters. You can download FASTA files for these genomes by selecting all the genomes in the table, and clicking the "DWNLD" button.
Click "More Options", and in the popup dialog box, check the box next to "Genomic Sequences in FASTA (*.fna)" before pressing "Download".
Of course you are free to add any additional genomes you'd like to better populate the spectrum of diversity in the genus. In our case, we added several Mycobacterium abscessus strains to fill in the so-called Mycobacterium abscessus complex (MAC).

Check bacteria for Prophages

Ideally, you'll run these genomes through PHASTER or some other prophage prediction tool to get the approximate coordinates of any complete prophages in these strains, and recording them in a CSV file that you'll pass to the training module. The coordinates don't have to be perfect, though the better they are the better the resultant model will perform. This step will reduce the probability that DEPhT treats a prophage found in multiple strains as "conserved bacterial genes", and also give the model an idea what integrated prophages are supposed to look, as opposed to only knowing what extracted phages/prophages look like.

Retrieve Phage Genomes

Lastly, you'll need to retrieve functionally annotated phages from Genbank or elsewhere. Like the bacteria, it's important that these phages represent the spectrum of diversity of phages infecting hosts in the genus. Ideally there will also be clusters of at least somewhat-related phages in this dataset.

Training the new model

DEPhT models are comprised of four main components, which can be built from curated phage and bacterial sequences.The only required arguments are:
  1. a name for the new model
  2. path to a directory containing functionally annotated phage genomes for the genus of interest
  3. path to a directory containing bacterial genomes for the genus of interest
If one or more of your bacterial genomes has one or more known (or probable) prophage(s) in it, you can provide a CSV file formatted in this way:
csv_format

Training a model consists of several computationally expensive steps, and as such the amount of time it takes to train a model is highly variable, but generally influenced in these ways: Most new models will likely take somewhere between 15 minutes and an hour to train.


{% endblock %}