Modules¶
sm_analysis¶
This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.
- pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]¶
From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:
mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)
modtype: column number 3 (idx: 2) of the gffs (feature type)
GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard
score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)
strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)
There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:
coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228
we would get the following ‘extra’ columns:
134,TCA...,3.91,228
and this is exactly what happens with the m6A modification type.
All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]
[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”
- pacbio_data_processing.sm_analysis.generate_CCS_file(in_bam: Union[pathlib.Path, str], ccs_bam_file: Optional[Union[pathlib.Path, str]])[source]¶
Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in
in_bam
file done withpacbio_data_processing.ccs.ccs()
.
- pacbio_data_processing.sm_analysis.generate_aligned_CCS_file(ccs_bam_file: Union[pathlib.Path, str], aligned_ccs_bam_file: Optional[Union[pathlib.Path, str]], fasta: Union[pathlib.Path, str], nprocs: Union[int, str], blasr: pacbio_data_processing.blasr.Blasr, variant: str = 'straight') Optional[pathlib.Path] [source]¶
It calls the
blasr
program to align theccs_bam_file
. Ifaligned_ccs_bam_file
is given, it is used as the output filename. If not given, a name is generated using a prefix based on thevariant
parameter followed byccs_bam_file
:variant='straight'
(default) ->blasr.
prefixvariant='pi-shifted'
->pi-shifted.blasr.
prefix
- Returns
The path to the aligned CCS BAM file for the requested variant, if it could be computed, or None if it could not. That is decided by the
blasr
object. An aligned BAM file cannot be computed byblasr
if it is being produced by a parallel computation. (Seepacbio_data_processing.blasr.Blasr
for details.)
- pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.