Modules

sm_analysis

This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.

pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]

From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:

  • mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)

  • modtype: column number 3 (idx: 2) of the gffs (feature type)

  • GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard

  • score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)

  • strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)

There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:

coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228

we would get the following ‘extra’ columns:

134,TCA...,3.91,228

and this is exactly what happens with the m6A modification type.

All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]

[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”

pacbio_data_processing.sm_analysis.generate_CCS_file(in_bam: Union[pathlib.Path, str], ccs_bam_file: Optional[Union[pathlib.Path, str]])[source]

Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in in_bam file done with pacbio_data_processing.ccs.ccs().

pacbio_data_processing.sm_analysis.generate_aligned_CCS_file(ccs_bam_file: Union[pathlib.Path, str], aligned_ccs_bam_file: Optional[Union[pathlib.Path, str]], fasta: Union[pathlib.Path, str], nprocs: Union[int, str], blasr: pacbio_data_processing.blasr.Blasr, variant: str = 'straight') Optional[pathlib.Path][source]

It calls the blasr program to align the ccs_bam_file. If aligned_ccs_bam_file is given, it is used as the output filename. If not given, a name is generated using a prefix based on the variant parameter followed by ccs_bam_file:

  • variant='straight' (default) -> blasr. prefix

  • variant='pi-shifted' -> pi-shifted.blasr. prefix

Returns

The path to the aligned CCS BAM file for the requested variant, if it could be computed, or None if it could not. That is decided by the blasr object. An aligned BAM file cannot be computed by blasr if it is being produced by a parallel computation. (See pacbio_data_processing.blasr.Blasr for details.)

pacbio_data_processing.sm_analysis.main_cl()[source]

Entry point for sm-analysis executable.

pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule][source]

Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.