What is the PacBio sequencing technology?
=========================================

.. warning::

   This document, or parts of it, might be obsolete. It will be
   reviewed and updated in short time.


PacBio sequencing technology allows the measurement of DNA
polymerization kinetics during the sequencing process. When the DNA
presents some type of modification such as methylation, the PacBio
sequencer records a change in the polymerization kinetics. The
relationship between modified DNA and the change in polymerization time
detected by this technology has allowed the study of methylation in many
organisms.

PacBio protocol
---------------

1. DNA extraction

2. DNA processing

3. DNA sequencing

4. Data output


How the PacBio sequencer output is processed?
---------------------------------------------

The first version of the PacBio sequencer (PacBio RS) produced only one
output file named bas.h5, in the following versions of the sequencer
(PacBio RS II), the output file produced four output files composed of 3
bax.h5 files and one bas.h5 file. Currently, the latest generation of
PacBio sequencers produces a single output file with a .bam extension
(`binary alignment
map <https://samtools.github.io/hts-specs/SAMv1.pdf>`__) . It is
possible to convert the output formats bas.h5/bax.h5 to a .bam format
using bioinformatic tools that can be installed through
`PacBio-Bioconda <https://github.com/PacificBiosciences/pbbioconda>`__
which offers several tools that will be useful during the primary and
secondary analysis of the data generated by the PacBio sequencer.

Primary and Secondary analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After sequencing, the generated PacBio output is processed in two steps.

* A **Primary analysis** is made by the instrument to obtain information
  related to the sequenced DNA like IPD values and sequencing quality. 
* A **Secondary analysis** includes some of the steps included in
  sm-analysis software like alignment and circular consensus sequence.

More information can be found `here <https://www.pacb.com/wp-content/uploads/2015/09/Pacific-Biosciences-Glossary-of-Terms.pdf>`__.

PacBio Tools
^^^^^^^^^^^^

PacBio provides the following analysis tools: 

* **SMRT**: Must be installed on server 
* **PacBio Bioconda**: Can be installed on server and personal computer

`PacBio-Bioconda <https://github.com/PacificBiosciences/pbbioconda>`__
tools are installed using a virtual environment. This includes several
tools, here we describe some of them:

-  **bax2bam** -
   `bax2bam <https://github.com/pacificbiosciences/bax2bam/>`__ allows
   to transform pacbio files from old versions to a bam file manageable
   with pacbio tools.
-  **blasr** - The official PacBio aligner adapted for long sequencing
   reads. Although other aligners such as BWA, Segemehl, and pbalign
   were compared,
   `blasr <https://github.com/pacificbiosciences/blasr/>`__ had the best
   mapping along with pbaling, both aligners found in the pacbio
   bioconda tools. It was decided to take blasr as the aligner to do the
   analyses because it is the only one whose result includes the ipd
   columns needed to be able to detect DNA modifications.
-  **pbmm2** - `pbmm2 <https://github.com/PacificBiosciences/pbmm2/>`__
   is a an aligner suggested to be a substitute for blasr. When
   evaluated, it turned out to be faster in the alignment process,
   however, there is not a big difference in the total number of aligned
   subreads. The output was not sorted by molecule and has therefore
   been discarded for the time being.
-  **CCS** - This tool generate the Circular Consensus Sequence (
   `CCS <https://ccs.how/>`__ )combining multiple subreads.

Modification detection tools
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are several `Base Modification
Tools <https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Base-Modification-Tools>`__
related to the detection of modified bases in DNA from files generated
by PacBio sequencing.

====================== ====================
Base Modification Tool Programming language
====================== ====================
**R-kinetics**         R
**kineticsTools**      Python
**MotifMaker**         Java
**MotifFinder**        SMRT, R
====================== ====================

kineticsTools
^^^^^^^^^^^^^

For the analyses we have decided to use
`kineticsTools <https://github.com/PacificBiosciences/kineticsTools/blob/master/doc/manual.rst>`__
for two reasons:

1. It uses python as a programming language just like PacBio bioconda
   and it is the programming language used to code sm-analysis.

2. It has a tool called ipdSummary that allows us to predict DNA
   modifications without the need for a control sample.


ipdSummary
^^^^^^^^^^

ipdSummary allows us to predict sequence modifications using a
computational model (in-silico model). This tool allows us to detect
modifications that occur at the nucleotide level in sequences that
present m5C, m4C or m6A methylation, the last one being the best
detected with this tool.

IpdSummary has its filters:

-  **Mapping Quality** - By default the minimum mapping quality required
   is 10, which implies that BLASR is 90% confident that the read is
   mapped correctly. However, we also find many subreads that have a
   mapping in more than one position that is sometimes at a great
   distance from each other.
-  **Number of subreads per molecule** - ipdSummary is effective on
   molecules that have at least 20 mapped subreads.
-  **Length of the subreads** - The PacBio output file can have
   subframes of different sizes ranging from less than 50 bases to
   thousands of bases.
-  **Multi-mapping** - Some molecules may sometimes contain subreads
   with different mapping positions and this affects the confidence of
   the predicted modification in a position. In some cases,
   multi-mapping occur in the region comprising the 0 positions of the
   reference sequence


sm-analysis
-----------

sm-analysis is based on the usage of different tools from
`PacBio-Bioconda <https://github.com/PacificBiosciences/pbbioconda>`__
that allow us the analysis of information coming from PacBio sequencers.
The difference of sm-analysis with the other approach is the possibility
to analyze every PacBio bell (DNA molecule with adapters) separately, to
obtain their methylation status, circular consensus sequence, position
start, end, and GATCs (in case of having it) relative to the reference
sequence. sm-analysis uses
`ipdSummary <https://github.com/PacificBiosciences/kineticsTools/blob/master/doc/manual.rst>`__
which is a pre-trained algorithm able to detect m6A methylations using
the Inter-Pulse Durations (IPD) values.

As we read previously there are different PacBio chemistries that
undergo periodic updates that create conflicts in data compatibility
during the data processing. To solve this, sm-analysis offers to the
user the option to change the chemistry to be used in the analysis.

We have also developed a bam-filter step that includes options to filter
our data. The following sequence of instructions can be used to filter
the aligned subreads.bam file:

+-----------------------------------+-----------------------------------+
| Option                            | Description                       |
+===================================+===================================+
| **-l 50 -q 254 -m 0 16 -R 0.9**   | Minimum subread length of 50      |
|                                   | bases Keep only high quality      |
|                                   | mapping Molecules with 90% unique |
|                                   | mapping                           |
+-----------------------------------+-----------------------------------+
| **-m 0 16**                       | Keeps only uniquely mapped        |
|                                   | subreads                          |
+-----------------------------------+-----------------------------------+
| **-r 20**                         | Only molecules with at least 20   |
|                                   | subreads                          |
+-----------------------------------+-----------------------------------+

\*In the table above, the filters are applied to an aligned subreads.bam
file using `blasr <https://github.com/pacificbiosciences/blasr/>`__.
When using a different aligner like
`pbmm2 <https://github.com/PacificBiosciences/pbmm2/>`__ you should
change the parameters for mapped subreads and mapping quality.

Pipeline
--------

TBD

Usage
-----

-  Input
-  Execution
-  Output description
