pacbio_data_processing package¶
Subpackages¶
Submodules¶
pacbio_data_processing.bam module¶
- class pacbio_data_processing.bam.BamFile(bam_file_name, mode='r')[source]¶
Bases:
object
Proxy class for _BamFileSamtools and _BamFilePysam. This is a high level class whose only roles are to choose among _ReadableBamFile and _WritableBamFile and to select the underlying implementation to interact with the BAM file:
- _BamFileSamtools: implementation that simply wraps the 'samtools' command line, and - _BamFilePysam: implementation that uses 'pysam'
pacbio_data_processing.bam_file_filter module¶
This module contains the high level functions necessary to apply some filters to a given input BAM file.
pacbio_data_processing.bam_utils module¶
Some helper functions to manipulate BAM files
- class pacbio_data_processing.bam_utils.CircularDNAPosition(pos: int, ref_len: int = 0)[source]¶
Bases:
object
A type that allows to do arithmetics with postitions in a circular topology.
>>> p = CircularDNAPosition(5, ref_len=9)
The class has a decent repr:
>>> p CircularDNAPosition(5, ref_len=9)
And we can use it in arithmetic contexts:
>>> p + 1 CircularDNAPosition(6, ref_len=9) >>> int(p+1) 6 >>> int(p+5) 1 >>> int(20+p) 7 >>> p - 1 CircularDNAPosition(4, ref_len=9) >>> int(p-6) 8 >>> int(p-16) 7 >>> int(2-p) 6 >>> int(8-p) 3
Also boolean equality is supported:
>>> p == CircularDNAPosition(5, ref_len=9) True >>> p == CircularDNAPosition(6, ref_len=9) False >>> p == CircularDNAPosition(14, ref_len=9) True >>> p == CircularDNAPosition(5, ref_len=8) False >>> p == 5 False
But also < is supported:
>>> p < p+1 True >>> p < p False >>> p < p-1 False
Of course two instances cannot be compared if their underlying references are not equally long:
>>> s = CircularDNAPosition(5, ref_len=10) >>> p < s Traceback (most recent call last): ... ValueError: cannot compare positions if topologies differ
or if they are not both CircularDNAPosition’s:
>>> s < 6 Traceback (most recent call last): ... TypeError: '<' not supported between instances of 'CircularDNAPosition' and 'int'
The class has a convenience method:
>>> p.as_1base() 6
If the ref_len input parameter is less than or equal to 0, the topology is assumed to be linear:
>>> q = CircularDNAPosition(5, ref_len=-1) >>> q CircularDNAPosition(5, ref_len=0) >>> q + 1001 CircularDNAPosition(1006, ref_len=0) >>> q - 100 CircularDNAPosition(-95, ref_len=0) >>> int(10-q) 5
Linear topology is the default behaviour:
>>> r = CircularDNAPosition(5) >>> r CircularDNAPosition(5, ref_len=0)
It is possitble to use them as indices in slices:
>>> seq = "ABCDEFGHIJ" >>> seq[r:r+2] 'FG'
And CircularDNAPosition instances can be hashed (so that they can be elements of a set or keys in a dictionary):
>>> positions = {p, q, r}
And, very conveniently, a CircularDNAPosition converts tp str as ints do:
>>> str(r) == '5' True
- class pacbio_data_processing.bam_utils.Molecule(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None)[source]¶
Bases:
object
Abstraction around a single molecule from a Bam file
- __init__(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None) None ¶
- property ascii_quals: str¶
Ascii qualities of sequencing the molecule. Each symbol refers to one base.
- property cigar: pacbio_data_processing.cigar.Cigar¶
- property dna: str¶
- property end: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Computes the end of a molecule as CircularDNAPosition(start+lenght of reference) which, obviously takes into account the possible circular topology of the reference.
- find_gatc_positions() list[pacbio_data_processing.bam_utils.CircularDNAPosition] [source]¶
The function returns the position of all the GATCs found in the Molecule’s sequence, taking into account the topology of the reference.
The return value is is the 0-based index of the GATC motif, ie, the index of the G in the Python convention.
- id: int¶
- is_crossing_origin(*, ori_pi_shifted=False) bool [source]¶
This method answers the question of whether the molecule crosses the origin, assuming a circular topology of the chromosome. The answer is
True
if the last base of the molecue is located before the first base. Otherwise the answer isFalse
. It will returnFalse
if the molecule starts at the origin; but it will beTrue
if it ends at the origin. There is an optional keyword-only boolean parameter, namelyori_pi_shifted
to indicate that the reference has been shifted by pi radians, or not.
- pi_shift_back() None [source]¶
Method that shifts back the (start, end) positions of the molecule assuming that they were shifted before by pi radians.
- src_bam_path: Optional[Union[str, pathlib.Path]] = None¶
- property start: pacbio_data_processing.bam_utils.CircularDNAPosition¶
Readable/Writable attribute. It was originally only readable but the
SingleMoleculeAnalysis
class relies on it being writable to make easier the shift back of pi-shifted positions, that are computed from this attribute. The logic is: by default, the value is taken from the_best_ccs_line
attribute, until it is modified, in which case the value is simply stored and returned upon request.
- pacbio_data_processing.bam_utils.count_subreads_per_molecule(bam: pacbio_data_processing.bam.BamFile) collections.defaultdict[int, collections.Counter] [source]¶
Given a read-open BamFile instance, it returns a defaultdict with keys being molecule ids (str) and values, a counter with subreads classified by strand. The possible keys of the returned counter are: +, -, ? meaning direct strand, reverse strand and unknown, respectively.
- pacbio_data_processing.bam_utils.gen_index_single_molecule_bams(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], program: Union[str, pathlib.Path], skip_if_present: bool = False) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
It generates indices using program (the path to pbindex) in this way:
pbindex blasr.pMA683.subreads.bam
the generator yields the original MoleculeWorkUnit.
Note for developers: Maybe it should check for errors and report them (since we are using an external tool) and do not yield the molecule if an error happens).
- pacbio_data_processing.bam_utils.join_gffs(work_units: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], out_file_path: Union[str, pathlib.Path]) collections.abc.Generator[pathlib.Path, None, None] [source]¶
The gff files related to the molecules provided in the input are read and joined in a single file. The individual gff files are yielded back.
Probably this function is useless and should be removed in the future: it only provides a joint gff file that is not a valid gff file and that is never used in the rest of the processing.
- pacbio_data_processing.bam_utils.split_bam_file_in_molecules(in_bam_file: Union[str, pathlib.Path], tempdir: Union[str, pathlib.Path], todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
All the individual molecules in the bam file path given,
in_bam_file
, that are found intodo
, will be isolated and stored individually in the directorytempdir
. The yielded Molecule instances will have theirsrc_bam_path
updated accordingly.
- pacbio_data_processing.bam_utils.subreads_per_molecule(lines: collections.abc.Iterable, header: bytes, file_name_prefix: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from the
lines
(coming from the iteration over aBamFile
instance). Before yielding, a one-molecule BAM file is created.
- pacbio_data_processing.bam_utils.write_one_molecule_bam(buf: collections.abc.Iterable, header: bytes, in_file_name: pathlib.Path, suffix: Any) pathlib.Path [source]¶
Given a sequence of BAM lines, a header, the source name and a suffix, a new
bamFile
is created containg the data provided an a suitable name.
pacbio_data_processing.blasr module¶
- class pacbio_data_processing.blasr.Blasr(path: Union[pathlib.Path, str])[source]¶
Bases:
object
An object to interact with the
blasr
aligner.- __call__(in_bamfile: Union[pathlib.Path, str], fasta: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str], nprocs: int = 1) Optional[int] [source]¶
It runs the
blasr
executable, with the given paramenters. The return code of the associated process is returned by this method ifblasr
could run at all, elseNone
is returned.One case where
blasr
cannot run is when the sentinel file is there before theblasr
process is run.
pacbio_data_processing.ccs module¶
pacbio_data_processing.cigar module¶
This module provides basic ‘re-invented’ functionality to handle Cigars. A Cigar describes the differences between two sequences by providing a series of operations that one has to apply to one sequence to obtain the other one. For instance, given these two sequences:
sequence 1 (e.g. from the refenrece):
AAGTTCCGCAAATT
and
sequence 2 (e.g. from the aligner):
AAGCTCCCGCAATT
The Cigar that brings us from sequence 1 to sequence 2 is:
3=1X3=1I4=1D2=
where the numbers refer to the amount of letters and the symbols’ meaning can be found in the table below. Therefore the Cigar in the example is a shorthand for:
3 equal bases followed by 1 replacement followed by 3 equal bases followed by 1 insertion followed by 4 equal bases followed by 1 deletion followed by 2 equal bases
symbol |
meaning |
---|---|
= |
equal |
I |
insertion |
D |
deletion |
X |
replacement |
S |
soft clip |
H |
hard clip |
- class pacbio_data_processing.cigar.Cigar(incigar)[source]¶
Bases:
object
- property diff_ratio¶
difference ratio:
1
means that each base is different;0
means that all the bases are equal.
- property number_diff_items¶
- property number_diff_types¶
- property number_pb_diffs¶
- property number_pbs¶
- property sim_ratio¶
similarity ratio:
1
means that all the bases are equal;0
means that each base is different.This is computed from
diff_ratio()
.
pacbio_data_processing.constants module¶
pacbio_data_processing.errors module¶
pacbio_data_processing.filters module¶
- pacbio_data_processing.filters.cleanup_molecules(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None] [source]¶
Generator of MoleculeWorkUnit’s that pass all the standard filters, ie the sequence of filters needed by
sm-analysis
to select what molecules (and what subreads in those molecules) will be IPD-analyzed.It is assumed that each file contains subreads corresponding to only ONE molecule (ie, ‘molecules’ is a generator of tuples (mol id, Molecule), with
Molecule
being related to a single molecule id). [Note for developers: Should we allow multiple molecules per file?]If there are subreads surviving the filtering process, the bam file is overwritten with the filtered data and the tuple (mol id, Molecule) is yielded. If no subread survives the process, nothing is done (no bam is written, no tuple is yielded).
- pacbio_data_processing.filters.filter_mappings_binary(lines, mappings, *rest)[source]¶
Simply take or reject mappings depending on passed sequence
pacbio_data_processing.ipd module¶
- pacbio_data_processing.ipd.ipd_summary(molecule: tuple[int, pacbio_data_processing.bam_utils.Molecule], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], nprocs: int, mod_types_comma_sep: str, ipd_model: Union[str, pathlib.Path], skip_if_present: bool) tuple[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Lowest level interface to
ipdSummary
: all calls to that program are expected to be done through this function. It runsipdSummary
with an input bam file like this:ipdSummary blasr.pMA683.subreads.bam --reference pMA683.fa --identify m6A --gff blasr.pMA683.subreads.476.bam.gff
As a result of this, a gff file is created. This function sets an attribute in the target Molecule with the path to that file.
Missing features:
skip_if_present
logging
error handling
check output and raise error if != 0
- pacbio_data_processing.ipd.multi_ipd_summary(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] ¶
Generator that yields gff files as they are produced in parallel. Implementation drived by a pool of threads.
- pacbio_data_processing.ipd.multi_ipd_summary_direct(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] [source]¶
Generator that yields gff files as they are produced. Serial implementation (one file produced after the other).
- pacbio_data_processing.ipd.multi_ipd_summary_threads(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[pathlib.Path, None, None] [source]¶
Generator that yields gff files as they are produced in parallel. Implementation drived by a pool of threads.
pacbio_data_processing.logs module¶
pacbio_data_processing.parameters module¶
- class pacbio_data_processing.parameters.BamFilteringParameters(cl_input)[source]¶
Bases:
pacbio_data_processing.parameters.ParametersBase
- property filter_mappings¶
- property limit_mappings¶
- property min_relative_mapping_ratio¶
- property out_bam_file¶
pacbio_data_processing.plots module¶
- pacbio_data_processing.plots.make_barsplot(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str]) None [source]¶
- pacbio_data_processing.plots.make_continuous_rolled_data(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], window: int) pandas.core.frame.DataFrame [source]¶
Auxiliary function used by
make_rolling_history
to produce a dataframe with the rolling average of the input data. The resulting dataframe starts at the min input position and ends at the max input position. The holes are set to 0 in the input data.
- pacbio_data_processing.plots.make_histogram(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None [source]¶
pacbio_data_processing.sam module¶
pacbio_data_processing.sentinel module¶
- class pacbio_data_processing.sentinel.Sentinel(checkpoint: pathlib.Path)[source]¶
Bases:
object
This class creates objects that are expected to be used as context managers. At
__enter__
a sentinel file is created. At__exit__
the sentinel file is removed. If the file is there before entering the context, or is not there when the context is exited, an exception is raised.- _anti_aging()[source]¶
Method that updates the modification time of the sentinel file every
SLEEP_SECONDS
seconds. This is part of the mechanism to ensure that the sentinel does not get fooled by an abandoned leftover sentinel file.
- property is_file_too_old¶
Property that answers the question: is the sentinel file too old to be taken as an active sentinel file, or not?
pacbio_data_processing.sm_analysis module¶
This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.
- class pacbio_data_processing.sm_analysis.MethylationReport(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]¶
Bases:
object
- PRELOG = '[methylation report]'¶
- property modification_types¶
- class pacbio_data_processing.sm_analysis.SingleMoleculeAnalysis(parameters)[source]¶
Bases:
object
- property CCS_bam_file¶
It produces a Circular Consensus Sequence (CCS) version of the input BAM file and returns its name. It uses
generate_CCS_file()
to generate the file.
- __call__()[source]¶
Main entry point to perform a single molecule analysis: this method triggers the analysis.
- _align_input_if_no_candidate_found(inbam: pacbio_data_processing.bam.BamFile, variant: str = 'straight') Optional[str] [source]¶
[Internal method] Auxiliary method used by
_ensure_input_bam_aligned
. It first checks whether the aligned file is there. If a plausible candidate is not found, the input BAM is aligned (straight
orπ-shifted
, depending on thevariant
and using the proper reference). IF a candidate is found, its computation is skipped.If the aligner cannot be run (i.e. calling the aligner returns
None
),None
is returned, meaning that the aligner was not called. This can happen when the aligned finds a sentinel file indicating that the computation is work in progress. (Seepacbio_data_processing.blasr.Blasr.__call__()
for more details.)- Returns
the aligned input bam file, if it is there, or None if it could not be computed (yet).
- _create_references()[source]¶
[Internal method] DNA reference sequences are created here. The ‘true’ reference must exist as fasta beforehand, with its index. A π-shifted reference is created from the original one. Its index is also made.
This method sets two attributes which are, both, mappings with two keys (‘straight’ and ‘pi-shifted’) and values as follows: - reference: the values are DNASeq objects - fasta: the values are Path objects
- _disable_pi_shifted_analysis() None [source]¶
[Internal method] If the pi-shifted analysis cannot be carried out, it is disabled with this method.
- _ensure_input_bam_aligned() None [source]¶
[Internal method] Main check point for aligned input bam files: this method calls whatever is necessary to ensure that the input bam is aligned, which means: normal (straight) alignment and π-shifted alignment.
Warning! The method tries to find a pi-shifted aligned BAM if the input is aligned based on whether 1. a file with suitable filename is found, and 2. it is aligned.
- _exists_pi_shifted_variant_from_aligned_input() bool [source]¶
[Internal method] It checks that the expected pi-shifted aligned file exists and is an aligned BAM file.
- property partition: pacbio_data_processing.utils.Partition¶
The target
Partition
of the input BAM file that must be processed by the current analysis, according to the input provided by the user.
- property workdir: tempfile.TemporaryDirectory¶
This attribute returns the necessary temporary working directory on demand and it ensures that only one temporary dir is created by caching.
- pacbio_data_processing.sm_analysis.add_to_own_output(gffs, own_output_file_name, modification_types)[source]¶
From a set of .gff files, a csv file (delimiter=”,”) is saved with the following columns:
mol id: taken each gff file (e.g. ‘a.b.c.gff’ -> mol id: ‘b’)
modtype: column number 3 (idx: 2) of the gffs (feature type)
GATC position: column number column number 5 (idx: 4) of the gffs which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard
score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)
strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)
There are more columns, but they are nor fixed in number. They correspond to the values given in the ‘attributes’ column of the gffs (col 9, idx 8). For example, given the following attributes column:
coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228
we would get the following ‘extra’ columns:
134,TCA...,3.91,228
and this is exactly what happens with the m6A modification type.
All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The value of identificationQV is a a phred transformed probability of having a detection. See eq. (8) in [1]
[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”
- pacbio_data_processing.sm_analysis.generate_CCS_file(in_bam: Union[pathlib.Path, str], ccs_bam_file: Optional[Union[pathlib.Path, str]])[source]¶
Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in
in_bam
file done withpacbio_data_processing.ccs.ccs()
.
- pacbio_data_processing.sm_analysis.generate_aligned_CCS_file(ccs_bam_file: Union[pathlib.Path, str], aligned_ccs_bam_file: Optional[Union[pathlib.Path, str]], fasta: Union[pathlib.Path, str], nprocs: Union[int, str], blasr: pacbio_data_processing.blasr.Blasr, variant: str = 'straight') Optional[pathlib.Path] [source]¶
It calls the
blasr
program to align theccs_bam_file
. Ifaligned_ccs_bam_file
is given, it is used as the output filename. If not given, a name is generated using a prefix based on thevariant
parameter followed byccs_bam_file
:variant='straight'
(default) ->blasr.
prefixvariant='pi-shifted'
->pi-shifted.blasr.
prefix
- Returns
The path to the aligned CCS BAM file for the requested variant, if it could be computed, or None if it could not. That is decided by the
blasr
object. An aligned BAM file cannot be computed byblasr
if it is being produced by a parallel computation. (Seepacbio_data_processing.blasr.Blasr
for details.)
- pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule] [source]¶
Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.
pacbio_data_processing.sm_analysis_gui module¶
pacbio_data_processing.split_bam module¶
pacbio_data_processing.summary module¶
- class pacbio_data_processing.summary.GATCCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'GATCs NOT in BAM file (%)': ('perc_all_gatcs_not_identified_in_bam',), 'GATCs NOT in methylation report (%)': ('perc_all_gatcs_not_in_meth',), 'GATCs in BAM file (%)': ('perc_all_gatcs_identified_in_bam',), 'GATCs in methylation report (%)': ('perc_all_gatcs_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'GATCs in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.MethTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Fully methylated (%)': ('fully_methylated_gatcs_wrt_meth',), 'Fully unmethylated (%)': ('fully_unmethylated_gatcs_wrt_meth',), 'Hemi-methylated in + strand (%)': ('hemi_plus_methylated_gatcs_wrt_meth',), 'Hemi-methylated in - strand (%)': ('hemi_minus_methylated_gatcs_wrt_meth',)}¶
- dependency_names = ('methylation_report',)¶
- index_labels = ('Percentage',)¶
- title = 'Methylation types in methylation report'¶
- class pacbio_data_processing.summary.MoleculeLenHistogram(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- column_name = 'len(molecule)'¶
- data_name = 'length'¶
- dependency_name = 'methylation_report'¶
- labels = ('Initial subreads', 'Analyzed molecules')¶
- legend = True¶
- title = 'Initial subreads and analyzed molecule length histogram'¶
- class pacbio_data_processing.summary.MoleculeTypeBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Filtered out': ('perc_filtered_out_mols', 'perc_filtered_out_subreads'), 'In Methylation report with GATC': ('perc_mols_in_meth_report_with_gatcs', 'perc_subreads_in_meth_report_with_gatcs'), 'In Methylation report without GATC': ('perc_mols_in_meth_report_without_gatcs', 'perc_subreads_in_meth_report_without_gatcs'), 'Mismatch discards': ('perc_mols_dna_mismatches', 'perc_subreads_dna_mismatches'), 'Used in aligned CCS': ('perc_mols_used_in_aligned_ccs', 'perc_subreads_used_in_aligned_ccs')}¶
- dependency_names = ('mols_used_in_aligned_ccs', 'mols_dna_mismatches', 'filtered_out_mols', 'methylation_report')¶
- index_labels = ('Number of molecules (%)', 'Number of subreads (%)')¶
- title = 'Processed molecules and subreads'¶
- class pacbio_data_processing.summary.PercAttribute(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]¶
Bases:
pacbio_data_processing.summary.ROAttribute
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- class pacbio_data_processing.summary.PositionCoverageBarsPlot(name=None)[source]¶
Bases:
pacbio_data_processing.summary.BarsPlotAttribute
- data_definition = {'Positions NOT covered by molecules in BAM file (%)': ('perc_all_positions_not_in_bam',), 'Positions NOT covered by molecules in methylation report (%)': ('perc_all_positions_not_in_meth',), 'Positions covered by molecules in BAM file (%)': ('perc_all_positions_in_bam',), 'Positions covered by molecules in methylation report (%)': ('perc_all_positions_in_meth',)}¶
- dependency_names = ('aligned_ccs_bam_files', 'methylation_report')¶
- index_labels = ('Percentage',)¶
- title = 'Position coverage in BAM file and Methylation report'¶
- class pacbio_data_processing.summary.PositionCoverageHistory(name=None)[source]¶
Bases:
pacbio_data_processing.summary.HistoryPlotAttribute
- dependency_name = 'methylation_report'¶
- labels = ('Positions',)¶
- legend = False¶
- len_column_name = 'len(molecule)'¶
- start_column_name = 'start of molecule'¶
- title = 'Sequencing positions covered by analyzed molecules'¶
- class pacbio_data_processing.summary.SimpleAttribute(name=None)[source]¶
Bases:
object
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- class pacbio_data_processing.summary.SummaryReport(bam_path, dnaseq)[source]¶
Bases:
collections.abc.Mapping
Final summary report generated by
sm-analysis
initially intended for humans.This class has been crafted to carefully control its attributes. Data can be fed into the class by setting some attributes. That process triggers the generation of other attributes, that are typically read-only.
After instantiating the class with the path to the input BAM and the dna sequence of the reference (instance of
DNASeq
), one must set some attributes to be able to save the summary report:s = SummaryReport(bam_path, dnaseq) s.methylation_report = path_to_meth_report s.raw_detections = path_to_raw_detections_file s.gff_result = path_to_gff_result_file s.mols_dna_mismatches = {20, 49, ...} # set of ints s.filtered_out_mols = {22, 493, ...} # set of ints s.mols_used_in_aligned_ccs = {3, 67, ...} # set of ints s.aligned_ccs_bam_files = { 'straight': aligned_ccs_path, 'pi-shifted': pi_shifted_aligned_ccs_path }
at this point all the necessary data is there and the report can be created:
s.save('summary_whatever.html')
- aligned_ccs_bam_files¶
- all_gatcs_identified_in_bam¶
- all_gatcs_in_meth¶
- all_gatcs_not_identified_in_bam¶
- all_gatcs_not_in_meth¶
- all_positions_in_bam¶
- all_positions_in_meth¶
- all_positions_not_in_bam¶
- all_positions_not_in_meth¶
- property as_html¶
- body_md5sum¶
- filtered_out_mols¶
- filtered_out_subreads¶
- full_md5sum¶
- fully_methylated_gatcs¶
- fully_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- fully_unmethylated_gatcs¶
- fully_unmethylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- gatc_coverage_bars¶
- gff_result¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- hemi_methylated_gatcs¶
- hemi_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- hemi_minus_methylated_gatcs¶
- hemi_minus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- hemi_plus_methylated_gatcs¶
- hemi_plus_methylated_gatcs_wrt_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- input_bam¶
- input_bam_size¶
- input_reference¶
- max_possible_methylations¶
- meth_type_bars¶
- methylation_report¶
- molecule_len_histogram¶
- molecule_type_bars¶
- mols_dna_mismatches¶
- mols_in_meth_report¶
- mols_in_meth_report_with_gatcs¶
- mols_in_meth_report_without_gatcs¶
- mols_ini¶
- mols_used_in_aligned_ccs¶
- perc_all_gatcs_identified_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_not_identified_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_gatcs_not_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_not_in_bam¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_all_positions_not_in_meth¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_filtered_out_mols¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_filtered_out_subreads¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_dna_mismatches¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_mols_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_dna_mismatches¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report_with_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_in_meth_report_without_gatcs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- perc_subreads_used_in_aligned_ccs¶
From a given attribute in a SummaryReport instance s, the percentage is computed (wrt the value s.total_attr) and returned as str.
- position_coverage_bars¶
- position_coverage_history¶
- raw_detections¶
The base class of all other descriptor managed attributes of
SummaryReport
. It is a wrapper around the_data
dictionary of the instance owning this attribute.
- ready_to_go(*attrs)[source]¶
Method used to check if some attributes are already usable or not (in other words if they have been already set or not).
- reference_base_pairs¶
- reference_md5sum¶
- reference_name¶
- subreads_dna_mismatches¶
- subreads_in_meth_report¶
- subreads_in_meth_report_with_gatcs¶
- subreads_in_meth_report_without_gatcs¶
- subreads_ini¶
- subreads_used_in_aligned_ccs¶
- switch_on(attribute)[source]¶
Method used by descriptors to inform the instance of ``SummaryReport``that some computed attributes needed by the plots are already computed and usable.
- total_gatcs_in_ref¶
pacbio_data_processing.templates module¶
pacbio_data_processing.types module¶
pacbio_data_processing.utils module¶
- class pacbio_data_processing.utils.DNASeq(raw_seq, name: str = '', description: str = '')[source]¶
Bases:
object
Wrapper around ‘Bio.Seq.Seq’.
- classmethod from_fasta(fasta_name: str)[source]¶
Returns a DNASeq from the first DNA sequence stored in the fasta named ‘fasta_name’.
- property md5sum: str¶
It returns the MD5 checksum’s hexdigest of the upper version of the sequence as a string.
- class pacbio_data_processing.utils.Partition(partition_specification: Optional[Tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]¶
Bases:
object
A Partition is a class that helps answering the following question: assuming that we are interested in processing a fraction of a BamFile, does the molecule ID
mol_id
belongs to that fraction, or not? A prior implementation consisted in storing all the molecule IDs in theBamFile
for a given partition in a set, and the answer is just obtained by querying if a molecule ID belongs to the set or not. That former implementation is not enough for the case of multiple alignment processes for the same rawBamFile
(eg, when a combined analysis of the so-called ‘straight’ and ‘pi-shifted’ variants is performed). In that case the partition is decided with one file. And all molecule IDs belonging to the non-empty intersection with the other file must be unambiguously accomodated in a certain partition. This class has been designed to solve that problem.- __init__(partition_specification: Optional[Tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]¶
- pacbio_data_processing.utils.combine_scores(scores)[source]¶
>>> combine_scores([10]) 10.0 >>> q = combine_scores([10, 12, 14]) >>> print(round(q, 6)) 7.204355 >>> q = combine_scores([30, 20, 100, 92]) >>> print(round(q, 6)) 19.590023 >>> q_500 = combine_scores([30, 20, 500]) >>> q_no_500 = combine_scores([30, 20]) >>> q_500 == q_no_500 True >>> combine_scores([200, 300, 500]) 200.0
- pacbio_data_processing.utils.find_gatc_positions(seq: str, offset: int = 0) set[int] [source]¶
Convenience function that computes the positions of all GATCs found in the given sequence. The values are relative to the offset.
>>> find_gatc_positions('AAAGAGAGATCGCGCGATC') == {7, 15} True >>> find_gatc_positions('AAAGAGAGTCGCGCCATC') set() >>> find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC') == {7, 12, 19} True >>> s = find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC', offset=23) >>> s == {30, 35, 42} True
- pacbio_data_processing.utils.pishift_back_positions_in_gff(gff_path: Union[str, pathlib.Path]) None [source]¶
The function parses the input GFF file (assumed to be a valid `GFF3`_ file) and shifts back the positions found in it (columns 4th and 5th of lines not starting by
#
). It is assumed that the positions in the input file (gff_path
) are referring to a pi-shifted origin. To undo the shift, the length of the sequence(s) is (are) read from the GFF3 directives (lines starting by##
), in particular from the##sequence-region
pragmas. This function can handle the case of multiple sequences.Warning! The function overwrites the input
gff_path
.
- pacbio_data_processing.utils.shift_me_back(pos: int, nbp: int) int [source]¶
Unshifts a given position taking into account that it has been previously shifted by half of the number of base pairs. It takes into account the possibility of having a sequence with an odd length.
@params:
pos - 1-based position of a base pair to unshift
nbp - number of base pairs in the reference
@returns:
unshifted position
Some examples:
>>> shift_me_back(3, 10) 8 >>> shift_me_back(1, 20) 11 >>> shift_me_back(3, 7) 6 >>> shift_me_back(4, 7) 7 >>> shift_me_back(5, 7) 1 >>> shift_me_back(7, 7) 3 >>> shift_me_back(1, 7) 4
To understand the operation of this function consider the following example. Given a sequence of 7 base pairs with the following indices found in the reference in the natural order, ie
1 2 3 4 5 6 7
then, after being pi-shifted the base pairs in the sequence are reordered, and the indices become (in parenthesis the former indices):
1’(=4) 2’(=5) 3’(=6) 4’(=7) 5’(=1) 6’(=2) 7’(=3)
The current function accepts primed indices and transforms them to the unprimed indices, ie, the positions returned refer to the original reference.
Module contents¶
Top-level package for PacBio data processing.