bamnostic package¶
bamnostic is an OS-agnostic, Pure Python BAM file parser.
The purpose of bamnostic is to open up BAM query capabilities to all current OS environments. Special care was taken to allow bamnostic to run in all versions of Python 2.7 onward. Furthermore, bamnostic does not require any packages outside of the standard library. Therefore, bamnostic will run on all stable versions of PyPy.
Note
SAM and CRAM support is not yet implemented
The three main classes of bamnostic are:
bamnostic.AlignmentFile
: the BAM file handlerbamnostic.AlignedSegment
: an aligned read object interfacebamnostic.bai.Bai
: if the BAM file has an associated index file (preferred), this is the file handler for it.
Note
Within the scope of personal research, reading BAM files is the only fully supported IO. The skeleton for writing BAM files is present, just not connected.
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
Core Module (bamnostic.core
)¶
- class bamnostic.core.Cigar(op_code, n_op, op_id, op_name)¶
Bases:
tuple
namedtuple
for handling CIGAR data- Parameters
op_code (int) – CIGAR operation index
n_op (int) – number of operations for a given op_code
op_id (str) – the string representation of the CIGAR representation (‘MIDNSHP=XB’)
op_name (str) – Longer string name for operation
- property n_op¶
Alias for field number 1
- property op_code¶
Alias for field number 0
- property op_id¶
Alias for field number 2
- property op_name¶
Alias for field number 3
- bamnostic.core.offset_qual(qual_string)[source]¶
Offsets the ASCII-encoded quality string to represent PHRED score.
Every base that is in the alignment is assigned a Phred score. A Phred score (\(Q\)) is defined as \(Q=-10\log_{10}P\), where \(P\) is the base-calling error probability. Phred quality scores tend range from 10 to 60. These qualities are then offset by 33 and ASCII-encoded for readability and storage.
- Parameters
qual_string (
str
orbytes
) – Phred quality scores without offset- Returns
ASCII-encoded Phred scores offest by adding 33 to base score.
- Return type
(str)
Examples
>>> qual_score = '' >>> ''.join(offset_qual(qual_score)) '<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;'
- class bamnostic.core.AlignmentFile(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False, compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶
Bases:
bamnostic.bam.BamReader
,bamnostic.bam.BamWriter
Wrapper to allow drop in replacement for BAM functionality in a
pysam
-like API.- Parameters
filepath_or_object (str |
file
) – the path or file object of the BAM filemode (str) – Mode for reading. BAM files are binary by nature (default: ‘rb’).
max_cache (int) – number of desired LRU cache size, preferably a multiple of 2 (default: 128).
index_filename (str) – path to index file (BAI) if it is named differently than the BAM file (default: None).
filename (str |
file
) – synonym for filepath_or_objectcheck_header (bool) – Obsolete method maintained for backwards compatibility (default: False)
check_sq (bool) – Inspect BAM file for @SQ entries within the header
reference_filename (str) – Not implemented. Maintained for backwards compatibility
filepath_index (str) – synonym for index_filename
require_index (bool) – require the presence of an index file or raise (default: False)
duplicate_filehandle (bool) – Not implemented. Raises warning if True.
ignore_truncation (bool) – Whether or not to allow trucated file processing (default: False).
- class bamnostic.core.AlignedSegment(_io)[source]¶
Bases:
object
Main class for handling reads within the BAM
- property query_alignment_end¶
One past the final index position of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
- property query_alignment_start¶
The starting index of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
- property query_name¶
Alias for the read name
- property query_alignment_sequence¶
The sequence of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
- property query_alignment_length¶
The length of the aligned portion of the read
query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.
- property reference_start¶
The starting index of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
- property reference_end¶
One past the final index position of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
- property reference_length¶
The length of the aligned portion of the read
reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping
- property query_sequence¶
The sequence of the read
query_ here means the query as seen in the BAM file, and therefore includes clipping
- property query_length¶
The length of sequence of the read
query_ here means the query as seen in the BAM file, and therefore includes clipping
- property next_reference_id¶
Return the reference id of mate read
- property next_reference_name¶
Return the reference name of mate read
- property next_reference_start¶
Return the reference name of mate read
- property is_duplicate¶
If the read is set as PCR or optical duplicate according to read flag
- property is_paired¶
If the read is set as paired according to read flag
- property is_proper_pair¶
If each segment properly aligned according to the aligner
- property is_qcfail¶
If the read is set as QC fail according to read flag
- property is_read1¶
If the read is set as the first read of mate pair according to read flag (implies the read is mated)
- property is_read2¶
If the read is set as the second read of mate pair according to read flag (implies the read is mated)
- property is_reverse¶
If the read is set as reversed according to read flag
- property is_secondary¶
If the read is set as secondary alignment according to read flag
- property is_supplementary¶
If the read is set as a supplementary alignment according to read flag
- property is_unmapped¶
If the read is set as unmapped according to read flag
- property mate_is_reverse¶
Check the read flag to see if the mat is reverse strand (implies the read is mated)
- property mate_is_unmapped¶
Check the read flag to see if mate is unmapped (implies the read is mated)
- property mapping_quality¶
The mapping quality (MAPQ) of the read
- get_tag(tag, with_value_type=False)[source]¶
Gets the value associated with a given tag key.
- Parameters
tag (str) – the tag of interest
with_value_type (bool) – return what kind of value the tag
- Returns
the value associated with a given tag or the value and type of value (as seen in BAM format)
- get_tags(with_value_type=False)[source]¶
Returns all the tags for a given read
- Parameters
with_value_type (bool) – return the tag value type (as defined by BAM format)
- Returns
list of tag tuples (with or without tag value type)
- Return type
f_tags(
list
)
- get_cigar_stats()[source]¶
Gets the counts of each CIGAR operation in the read and number of nucleotides related to those given operations.
- Returns
list of CIGAR operation counts nt_counts (
list
): list of nucleotide counts for each operation- Return type
op_blocks (
list
)
- ref_gen()[source]¶
Recreates the reference sequence associated with the given segment.
Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome.
- Returns
generated reference sequence
- Return type
(str)
- Raises
KeyError – if read does not contain MD tag
- to_bam()[source]¶
Writes the alignment record to a BAM file
- Parameters
bam_file (string or
bamnostic.bam.BamWriter
) – BAM file path or open bam file in a write mode
- bamnostic.core.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.core.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Bgzip BAM Format (BGZF) file handlers (bamnostic.bgzf
)¶
- bamnostic.bgzf.unpack_gzip_header(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bgzf.unpack_subfields(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bgzf.unpack_gzip_integrity(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bgzf.unpack_bgzf_metaheader(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bgzf.get_block(handle, offset=0)[source]¶
Pulls out entire GZIP block
Used primarily for copying the header block of a BAM file. However, it can be used to copy any BGZF block within a BAM file that starts at the given offset.
Note
Does not progress file cursor position.
- Parameters
handle (
file
) – open BAM fileoffset (int) – offset of BGZF block (default: 0)
- Returns
Complete BGZF block
- Raises
ValueError – if the BGZF block header is malformed
Example
>>> bam_header = get_block(bamnostic.example_bam) >>> try: ... bam_header.startswith(b'\x1f\x8b\x08\x04') ... except SyntaxError: ... bam_header.startswith('\x1f\x8b\x08\x04') True
- class bamnostic.bgzf.BgzfReader(filepath_or_object, mode='rb', max_cache=None, filename=None, duplicate_filehandle=None, ignore_truncation=False)[source]¶
Bases:
object
The BAM reader. Heavily modified from Peter Cock’s BgzfReader.
- seek(virtual_offset)[source]¶
Seek to a 64-bit unsigned BGZF virtual offset.
A virtual offset is a composite number made up of the compressed offset (coffset) position of the start position of the BGZF block that the position originates within, and the uncompressed offset (uoffset) within the deflated BGZF block where the position starts. The virtual offset is defined as
virtual_offset = coffset << 16 | uoffset
- Parameters
virtual_offset (int) – 64-bit unsigned composite byte offset
- Returns
an echo of the new position
- Return type
virtual_offset (int)
- Raises
ValueError – if within block offset is more than block size
AssertionError – if the start position is not the block start position
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.seek(10) 10
>>> bam.seek(bamnostic.utils.make_virtual_offset(0, 42)) Traceback (most recent call last): ... ValueError: Within offset 42 but block size only 38
- read(size=- 1)[source]¶
Read method for the BGZF module.
- Parameters
size (int) – the number of bytes to read from file. Advances the cursor.
- Returns
byte string of length size
- Return type
data (
bytes
)- Raises
NotImplementedError – if the user tries to read the whole file
AssertionError – if read does not return any data
- readline()[source]¶
Read a single line for the BGZF file.
Binary operations do not support readline(). Code is commented out for posterity sake
- class bamnostic.bgzf.BgzfWriter(filepath_or_object, mode='wb', compresslevel=6)[source]¶
Bases:
object
Define a BGZFWriter object.
- close()[source]¶
Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.
samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.
- bamnostic.bgzf.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bgzf.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Binary Alignment Map (BAM) file handlers (bamnostic.bam
)¶
- class bamnostic.bam.BAMheader(_io)[source]¶
Bases:
object
Parse and store the BAM file header
The BAM header is the plain text and byte-encoded metadata of a given BAM file. Information stored in the header are the number, length, and name of the reference sequences that reads were aligned to; version of software used; read group identifiers; etc. The BAM format also stipulates that the first block of any BAM file should be reserved just for the BAM header block.
- _io¶
opened BAM file object
- Type
file
- SAMheader¶
parsed dictionary of the SAM header
- Type
dict
- n_refs¶
number of references
- Type
int
- refs¶
reference names and lengths listed in the BAM header
- Type
dict
- class bamnostic.bam.BamReader(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False)[source]¶
Bases:
bamnostic.bgzf.BgzfReader
The BAM reader. Heavily modified from Peter Cock’s BgzfReader.
- header¶
representation of header data (if present)
- lengths¶
lengths of references listed in header
- Type
list
ofint
- nocoordinate¶
number of reads that have no coordinates
- Type
int
- nreferences¶
number of references in header
- Type
int
- ref2tid¶
refernce names and refID dictionary
- Type
dict
ofstr
,int
- references¶
names of references listed in header
- Type
list
ofstr
- text¶
SAM header (if present)
- Type
str
- unmapped¶
number of unmapped reads
- Type
int
Note
This implementation is likely to change. While the API was meant to mirror pysam, it makes sense to include the pysam-like API in an extension that will wrap the core reader. This would be a major refactor, and therefore will not happen any time soon (30 May 2018).
- check_index(index_filename=None, req_idx=False)[source]¶
Checks to make sure index file is available. If not, it disables random access.
- Parameters
index_filename (str) – path to index file (BAI) if it does not fit naming convention (default: None).
req_idx (bool) – Raise error if index file is not present (default: False).
- Returns
True if index is present, else False
- Return type
(bool)
- Raises
IOError – If the index file is closed or index could not be opened
- Warns
UserWarning – If index could not be loaded. Random access is disabled.
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam) >>> bam.check_index(bamnostic.example_bam + '.bai') True
>>> bam.check_index('not_a_file.bai') False
- property nocoordinate¶
Get the number of reads without coordiantes according to the statistics recorded in the index.
- Returns
sum of reads without coordinates
- Return type
(int)
- property mapped¶
Get the number of mapped reads according to the statistics recorded in the index.
- Returns
sum of mapped reads
- Return type
(int)
- property unmapped¶
Get the number of unmapped reads without coordiantes and without coordinate.
- Returns
sum of unmapped reads and reads without coordinates
- Return type
(int)
- property references¶
Get all references used in alignment.
- Returns
reference names
- Return type
(
tuple
ofstr
)
- property nreferences¶
Get the number of references listed in alignment
- Returns
count of references
- Return type
(int)
- property lengths¶
Get all reference lengths used in alignment.
- Returns
reference lengths
- Return type
(
tuple
ofint
)
- has_index()[source]¶
Checks if file has index and it is open
- Returns
True if present and opened, else False
- Return type
bool
- fetch(contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)[source]¶
Creates a generator that returns all reads within the given region. (inclusive, exclusive)
- Parameters
contig (str) – name of reference/contig
start (int) – start position of region of interest (0-based)
stop (int) – stop position of region of interest (0-based)
region (str) – SAM region formatted string. Accepts tab-delimited values as well
tid (int) – the refID or target id of a reference/contig
until_eof (bool) – iterate until end of file
mutiple_iterators (bool) – allow multiple iterators over region. Not Implemented. Notice: each iterator will open up a new view into the BAM file, so overhead will apply.
reference (str) – synonym for contig
end (str) – synonym for stop
- Yields
reads over the region of interest if any
- Raises
ValueError – if the genomic coordinates are out of range or invalid
KeyError – Reference is not found in header
Note
SAM region formatted strings take on the following form: ‘chr1:100000-200000’
- Usage:
AlignmentFile.fetch(contig='chr1', start=1, stop= 1000) AlignmentFile.fetch('chr1', 1, 1000) AlignmentFile.fetch('chr1:1-1000') AlignmentFile.fetch('chr1', 1) AlignmentFile.fetch('chr1')
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> next(bam.fetch('chr1', 1, 100)) EAS56_57:6:190:289:82 ... MF:C:192
>>> next(bam.fetch('chr10', 1, 10)) Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
>>> next(bam.fetch('chr1', 1700, 1701)) Traceback (most recent call last): ... ValueError: Genomic region out of bounds.
>>> next(bam.fetch('chr1', 100, 10)) Traceback (most recent call last): ... AssertionError: Malformed region: start should be <= stop, you entered 100, 10
- count(contig=None, start=None, stop=None, region=None, until_eof=False, tid=None, read_callback='nofilter', reference=None, end=None)[source]¶
Count the number of reads in the given region
Note: this counts the number of reads that overlap the given region.
Can potentially make use of a filter for the reads (or custom function that returns True or False for each read).
- Parameters
contig (str) – the reference name (Default: None)
reference (str) – synonym for contig (Default: None)
start (int) – 0-based inclusive start position (Default: None)
stop (int) – 0-based exclusive start position (Default: None)
end (int) – Synonym for stop (Default: None)
region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
until_eof (bool) – count number of reads from start to end of file Note, this can potentially be an expensive operation. (Default: False)
read_callback (str|function) –
select (or create) a filter of which reads to count. Built-in filters:
all: skips reads that contain the following flags:
0x4 (4): read unmapped
0x100 (256): not primary alignment
0x200 (512): QC Fail
0x400 (1024): PCR or optical duplicate
nofilter: uses all reads (Default)
- The user can also supply a custom function that
returns boolean objects for each read
- Returns
count of reads in the given region that meet parameters
- Return type
(int)
- Raises
ValueError – if genomic coordinates are out of range or invalid or random access is disabled
RuntimeError – if read_callback is not properly set
KeyError – Reference is not found in header
AssertionError – if genomic region is malformed
Notes
SAM region formatted strings take on the following form: ‘chr1:100000-200000’
- Usage:
AlignmentFile.count(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count(‘chr1’, 1, 1000) AlignmentFile.count(‘chr1:1-1000’) AlignmentFile.count(‘chr1’, 1) AlignmentFile.count(‘chr1’)
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.count('chr1', 1, 100) 2
>>> bam.count('chr1', 1, 100, read_callback='all') 1
>>> bam.count('chr10', 1, 10) Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
>>> bam.count('chr1', 1700, 1701) Traceback (most recent call last): ... ValueError: Genomic region out of bounds.
>>> bam.count('chr1', 100, 10) Traceback (most recent call last): ... AssertionError: Malformed region: start should be <= stop, you entered 100, 10
- count_coverage(contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None, base_quality_threshold=0)[source]¶
Counts the coverage of each base supported by a read the given interval.
Given an interval (inclusive, exclusive), this method pulls each read that overlaps with the region. To ensure that the read truly overlaps with the region, the CIGAR string is required. These reads can further be filtered out by their flags, MAPQ qualities, or custom filtering function. Using the CIGAR string, the aligned portion of the read is traversed and the presence of each nucleotide base is tallied into respective arrays. Additionally, the user can choose to filter the counted bases based on its base quality score that is stored in the quality string.
- Parameters
contig (str) – the reference name (Default: None)
reference (str) – synonym for contig (Default: None)
start (int) – 0-based inclusive start position (Default: None)
stop (int) – 0-based exclusive start position (Default: None)
end (int) – Synonym for stop (Default: None)
region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)
quality_threshold (int) – MAPQ quality threshold (Default: 15)
base_quality_threshold (int) – base quality score threshold (Default: 0)
read_callback (str|function) –
select (or create) a filter of which reads to count. Built-in filters:
- all: skips reads that contain the following flags:
0x4 (4): read unmapped
0x100 (256): not primary alignment
0x200 (512): QC Fail
0x400 (1024): PCR or optical duplicate
nofilter: uses all reads (Default)
The user can also supply a custom function that returns boolean objects for each read
- Returns
Four arrays in the order of A, C, G, T
- Return type
(
array.array
)- Raises
ValueError – if genomic coordinates are out of range or invalid, random access is disabled, or nucleotide base is unrecognized
RuntimeError – if read_callback is not properly set
KeyError – Reference is not found in header
AssertionError – if genomic region is malformed
Notes
SAM region formatted strings take on the following form: ‘chr1:100-200’
- Usage:
AlignmentFile.count_coverage(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count_coverage(‘chr1’, 1, 1000) AlignmentFile.count_coverage(‘chr1:1-1000’) AlignmentFile.count_coverage(‘chr1’, 1) AlignmentFile.count_coverage(‘chr1’)
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> for arr in bam.count_coverage('chr1', 100, 150): ... print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist())))) array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0]) array('L', [0, 0, 0, 0, ..., 15, 0, 14, 0, 0]) array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 14]) array('L', [0, 0, 0, 0, ..., 0, 15, 0, 14, 0])
>>> for arr in bam.count_coverage('chr1', 100, 150, quality_threshold=20, base_quality_threshold=25): ... print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist())))) array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0]) array('L', [0, 0, 0, 0, ..., 13, 0, 14, 0, 0]) array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 13]) array('L', [0, 0, 0, 0, ..., 0, 14, 0, 13, 0])
- get_index_stats()[source]¶
Inspects the index file (BAI) for alignment statistics.
Every BAM index file contains metrics regarding the alignment process for the given BAM file. The stored data are the number of mapped and unmapped reads for a given reference. Unmapped reads are paired end reads where only one part is mapped. Additionally, index files also contain the number of unplaced unmapped reads. This is stored within the nocoordinate instance attribute (if present).
- Returns
list of tuples for each reference in the order seen in the header. Each tuple contains the number of mapped reads, unmapped reads, and the sum of both.
- Return type
idx_stats (
list
oftuple
)- Raises
AssertionError – if the index file is not available
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_index_stats() [(1446, 18, 1464), (1789, 17, 1806)]
>>> bam_no_bai = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb', index_filename='not_a_file.bai') >>> bam_no_bai.get_index_stats() Traceback (most recent call last): ... AssertionError: No index available
- is_valid_tid(tid)[source]¶
Return True if TID/RefID is valid.
- Returns
True if TID/refID is valid, else False
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.is_valid_tid(0) True
>>> bam.is_valid_tid(10) # because there are only 2 in this file False
- get_reference_name(tid)[source]¶
Convert TID/refID to reference name.
The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.
- Parameters
tid (int) – TID/refID of desired reference/contig
- Returns
String representation of chromosome if valid, else None
- Raises
KeyError – if TID/refID is not valid
Examples
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_reference_name(0) 'chr1'
>>> bam.get_reference_name(10) Traceback (most recent call last): ... KeyError: '10 is not a valid TID/refID for this file.'
- get_tid(reference)[source]¶
Convert reference/contig name to refID/TID.
The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.
- Parameters
reference (str) – reference/contig name
- Returns
the TID/refID of desired reference/contig
- Return type
(int)
- Raises
KeyError – if reference name not found file header
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.get_tid('chr1') 0
>>> bam.get_tid('chr10') Traceback (most recent call last): ... KeyError: 'chr10 was not found in the file header'
- mate(AlignedSegment)[source]¶
Gets the mate to a given AlignedSegment.
Note
Slow, when compared to the C-API. Not meant for high-throughput analysis.
Does not advance current iterator position.
- Parameters
AlignedSegment (
bamnostic.AlignedSegment
) – a bamnostic AlignedSegment read with a mate- Returns
if read has a valid mate, else None
- Return type
(
bamnostic.AlignedSegment
)- Raises
ValueError – if AlignedSegment is unpaired
- head(n=5, multiple_iterators=False)[source]¶
List out the first n reads of the file.
This method is primarily used when doing an initial exploration of the data. Whether or not multiple_iterators is used, cursor position within the file will not change.
Note
Using multiple_interators opens a new file object of the same file currently in use and, thus, impacts the memory footprint of your analysis.
- Parameters
n (int) – number of aligned reads to print (default: 5)
mutliple_iterators (bool) – Whether to use current file object or create a new one (default: False).
- Returns
list of n reads from the front of the BAM file
- Return type
head_reads (
list
ofAlignedSegment
)
Example
>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb') >>> bam.head(n=5, multiple_iterators = False)[0] EAS56_57:6:190:289:82 ... MF:C:192
>>> bam.head(n = 5)[1] EAS56_57:6:190:289:82 ... UQ:C:0
- class bamnostic.bam.BamWriter(filepath_or_object, mode='xb', compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶
Bases:
bamnostic.bgzf.BgzfWriter
- write_header(copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]¶
Modified code from Peter Cock’s [BamWriter](https://github.com/peterjc/biopython/blob/SamBam2015/Bio/Sequencing/SamBam/__init__.py#L1714)
Writes the header into a BAM file
- Parameters
copy_header (filepath_or_object) – copy the header from another BAM file
header (bytes) – SAM-style header
reference_names (
list
ofstr
) – list of all reference names used in filereference_lengths (
list
ofint
) – list of all reference lengths used in file
- bamnostic.bam.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bam.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
BAM Index (BAI) file handler (bamnostic.bai
)¶
BAI file parser
Copyright © 2018, Marcus D. Sherman
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
The Binary Alignment Map (BAM) format (defined at https://samtools.github.io/hts-specs/SAMv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a BAM index (BAI) file is created. Generally speaking, a BAI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the BAI is produced, random access to the BAM is available.
This script is for parsing the binary encoded BAI file, inherently making it human readable. Furthermore, it allows subsections of the BAI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic
- bamnostic.bai.unpack_chunk(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bai.unpack_intervals(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bai.unpack_bid_nchunk(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bai.unpack_unmapped(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- class bamnostic.bai.RefIdx(start_offset, end_offset, n_bins)¶
Bases:
tuple
- property end_offset¶
Alias for field number 1
- property n_bins¶
Alias for field number 2
- property start_offset¶
Alias for field number 0
- bamnostic.bai.reg2bin(rbeg, rend)[source]¶
Finds the largest superset bin of region. Numeric values taken from hts-specs
- Parameters
rbeg (int) – inclusive beginning position of region
rend (int) – exclusive end position of region
- Returns
distinct bin ID for largest superset bin of region
- Return type
(int)
- bamnostic.bai.reg2bins(rbeg, rend)[source]¶
Generates bin ids which overlap the specified region.
- Parameters
rbeg (int) – inclusive beginning position of region
rend (int) – exclusive end position of region
- Yields
(int) – bin IDs for overlapping bins of region
- Raises
AssertionError (Exception) – if the range is malformed or invalid
- class bamnostic.bai.Bai(filename)[source]¶
Bases:
object
This class defines the bam index file object and its interface.
The purpose of this class is the binary parsing of the bam index file (BAI) associated with a given bam file. When queried, the Bai object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.
Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)
- _io¶
opened BAI file object
- Type
fileObject
- _LINEAR_INDEX_WINDOW¶
constant of the linear interval window size
- Type
int
- _UNMAP_BIN¶
constant for bin ID of unmapped read stats
- Type
int
- magic¶
first 4 bytes of file. Must be equal to b’BAI’
- Type
bytes
- n_refs¶
number of references in BAI
- Type
int
- unmapped¶
dictionary of the unmapped read stats by each reference
- Type
dict
- current_ref¶
dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.
- Type
None|dict
- ref_indices¶
dictionary of reference ids and their start/stop offsets within the BAI file
- Type
dict
- n_no_coord¶
if present in BAI, is the number of reads that have no coordinates
- Type
None|int
- _last_pos¶
used for indexing, the byte position of the file head.
- Type
int
- get_chunks(n_chunks)[source]¶
Simple generator for unpacking chunk data
Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.
Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.
- Parameters
n_chunks (int) – number of chunks to be unpacked from stream
- Returns
- a list of Chunk objects with the attributes of
chunks[i] are .voffset_beg and voffset_end
- Return type
chunks (list)
- get_ints(n_int)[source]¶
Unpacks n_int number of interval virtual offsets from stream
A linear interval is defined as a 16384 bp window along a given reference. The value stored in the BAI is the virtual offset of the first read within that given interval. This virtual offset is the byte offset (coffset) of the start of the BGZF block that contains the beginning of the read and the byte offset (uoffset) within the uncompressed data of the residing BGZF block to that first read.
Note: a caveat to using linear interval with long reads: A long read can span multiple linear intervals. As such, the current encoding could potentially shift the expected region of interest to the left more than expected.
- Parameters
n_int (int) – number of intervals to unpack
- Returns
list of virtual offsets for n_int number of linear intervals
- Return type
intervals (list)
- get_bins(n_bins, ref_id=None, idx=False)[source]¶
Simple function that iteratively unpacks the data of a number (n_bin) of bins.
As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in the BAI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.
As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True
- Parameters
n_int (int) – number of bins to be unpacked from stream
- Returns
- None if just indexing the index file or a dictionary
of bin_id: chunks pairs
- Return type
bins (None | dict)
- Raises
AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
- get_ref(ref_id=None, idx=False)[source]¶
Interatively unpacks all the bins, linear intervals, and chunks for a given reference
A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.
This function also serves to “index” the BAI file such that, if it is invoked by setting ids=True, will do a single pass through the BAI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing the BAI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.
- Parameters
ref_id (None|int) – used for random access or indexing the BAI
idx (bool) – Flag for setting whether or not to run an index of the BAI
- Returns
namedtuple containing the byte offsets of the reference start, stop, and number of bins or Ref: namedtuple containing a dictionary of bins and list of linear intervals
- Return type
- Raises
AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.
- query(ref_id, start, stop=- 1)[source]¶
Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start
- Parameters
ref (int) – which reference/chromosome TID
start (int) – left most bp position of region (zero-based)
stop (int) – right most bp position of region (zero-based)
- Returns
- the voffset_beg of the first chunk given the chunk’s voffset_end
is greater than the voffset of the linear index that overlaps the region of interest’s start offset
- Return type
(int)
- seek(offset=None, whence=0)[source]¶
Simple seek function for binary files
- Parameters
offset (None|int) – byte offset from whence to move the file head to.
whence (int) – 0 := from start of file, 1:= from current position, 2:= from end of file
- Returns
new byte position of file head
- Return type
(int)
- Raises
ValueError (Exception) – if the offset is not an integer or is not provided
- bamnostic.bai.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.bai.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
Coordinate-Sorted Index (CSI) file handler (bamnostic.csi
)¶
CSI file parser
Copyright © 2018, Marcus D. Sherman
This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.
The Binary Alignment Map (CSI) format (defined at http://samtools.github.io/hts-specs/CSIv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a CSI index file can be created. Generally speaking, a CSI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the CSI is produced, random access to the BAM is available.
This script is for parsing the binary encoded CSI file, inherently making it human readable. Furthermore, it allows subsections of the CSI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic
- bamnostic.csi.unpack_chunk(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.csi.unpack_bid_loffset_nchunk(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.csi.unpack_unmapped(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.csi.reg2bin_csi(rbeg, rend, min_shift=14, depth=5)[source]¶
Finds the largest superset bin of region. Numeric values taken from hts-specs
- Parameters
rbeg (int) – inclusive beginning position of region
rend (int) – exclusive end position of region
- Returns
distinct bin ID for largest superset bin of region
- Return type
(int)
- bamnostic.csi.reg2bins_csi(rbeg, rend, min_shift=14, depth=5)[source]¶
Generates bin ids which overlap the specified region.
- Parameters
rbeg (int) – inclusive beginning position of region
rend (int) – exclusive end position of region
- Yields
(int) – bin IDs for overlapping bins of region
- Raises
AssertionError (Exception) – if the range is malformed or invalid
- class bamnostic.csi.Csi(filename)[source]¶
Bases:
bamnostic.bai.Bai
This class defines the bam CSI index file object and its interface.
The purpose of this class is the binary parsing of the bam index file (CSI) associated with a given bam file. When queried, the Csi object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.
Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)
- _io¶
opened CSI file object
- Type
fileObject
- _UNMAP_BIN¶
constant for bin ID of unmapped read stats
- Type
int
- magic¶
first 4 bytes of file. Must be equal to b’CSI’
- Type
bytes
- n_refs¶
number of references in CSI
- Type
int
- unmapped¶
dictionary of the unmapped read stats by each reference
- Type
dict
- current_ref¶
dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.
- Type
None|dict
- ref_indices¶
dictionary of reference ids and their start/stop offsets within theCSI file
- Type
dict
- n_no_coord¶
if present inCSI, is the number of reads that have no coordinates
- Type
None|int
- _last_pos¶
used for indexing, the byte position of the file head.
- Type
int
- get_chunks(n_chunks)[source]¶
Simple generator for unpacking chunk data
Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.
Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.
- Parameters
n_chunks (int) – number of chunks to be unpacked from stream
- Returns
- a list of Chunk objects with the attributes of
chunks[i] are .voffset_beg and voffset_end
- Return type
chunks (list)
- get_bins(n_bins, ref_id=None, idx=False)[source]¶
Simple function that iteratively unpacks the data of a number (n_bin) of bins.
As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in theCSI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.
As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True
- Parameters
n_int (int) – number of bins to be unpacked from stream
- Returns
- None if just indexing the index file or a dictionary
of bin_id: chunks pairs
- Return type
bins (None | dict)
- Raises
AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly
- get_ref(ref_id=None, idx=False)[source]¶
Iteratively unpacks all the bins, linear intervals, and chunks for a given reference
A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.
This function also serves to “index” theCSI file such that, if it is invoked by setting ids=True, will do a single pass through theCSI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing theCSI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.
- Parameters
ref_id (None|int) – used for random access or indexing theCSI
idx (bool) – Flag for setting whether or not to run an index of theCSI
- Returns
namedtuple containing the byte offsets of the reference start, stop, and number of bins o Ref: namedtuple containing a dictionary of bins and list of linear intervals
- Return type
- Raises
AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.
- query(ref_id, start, stop=- 1)[source]¶
Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start
- Parameters
ref (int) – which reference/chromosome TID
start (int) – left most bp position of region (zero-based)
stop (int) – right most bp position of region (zero-based)
- Returns
- the voffset_beg of the first chunk given the chunk’s voffset_end
is greater than the voffset of the linear index that overlaps the region of interest’s start offset
- Return type
(int)
- bamnostic.csi.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.csi.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
BAMnostic Utilities (bamnostic.utils
)¶
- bamnostic.utils.format_warnings(message, category, filename, lineno, file=None, line=None)[source]¶
Sets STDOUT warnings
- Parameters
message – the unformatted warning message being reported
category (str) – the level of warning (handled by warnings module)
filename (str) – filename for logging purposes (defaults to STDOUT)
lineno (int) – where the error occurred.
- Returns
formatted warning string
- bamnostic.utils.unpack_int32(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- bamnostic.utils.unpack_int32L(buffer, /)¶
Return a tuple containing unpacked values.
Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.
See help(struct) for more on format strings.
- class bamnostic.utils.Roi(contig=None, start=None, stop=None, tid=None)[source]¶
Bases:
object
Small __slots__ class for region of interest parsing
- bamnostic.utils.flag_decode(flag_code)[source]¶
Simple read alignment flag decoder
Every read within a BAM file ought to have an associated flag code. Theses flags are used for read filtering and QC. The flags are described below. Additionally, they can be found here
Any given read’s flag is determined by the or (|) operand of all appropriate bit flags.
- Parameters
flag_code (int) – either a standalone integer/bit flag or the read object itself
- Returns
list of flag and flag description tuples.
- Return type
(
list
oftuple
)- Raises
ValueError – if provided flag is not a valid entry
Example
If a flag is 516 it is comprised of flag 4 and flag 512
>>> flag_decode(516) [(4, 'read unmapped'), (512, 'QC fail')]
Int
Bit
Description
1
0x1
Template having multiple segments in sequencing
2
0x2
Each segment properly aligned according to the aligner
4
0x4
Segment unmapped
8
0x8
Next segment in the template unmapped
16
0x10
SEQ being reverse complemented
32
0x20
SEQ of the next segment in the template being reverse complemented
64
0x40
The first segment in the template
128
0x80
The last segment in the template
256
0x100
Secondary alignment
512
0x200
Not passing filters, such as platform/vendor quality controls
1024
0x400
PCR or optical duplicate
2048
0x800
Supplementary alignment
- bamnostic.utils.parse_region(contig=None, start=None, stop=None, region=None, tid=None, reference=None, end=None, until_eof=False)[source]¶
Parses region information from all user set parameters.
The main goal of this function is to handle the many different ways a user can put in genomic region data. One way is through keyword arguments. This is the most straight forward. However, due to Pysam’s API, there are multiple synonyms for reference/contig or stop/end. Additionally, if the user knows the refID/TID of the reference they are interested in, they can input that way as well.
The second form a submission can take is through positional arguments. Just like keyword, but ordered such that it makes up a genomic region of interest.
Note
Positional arguments make utilizing the tid parameter difficult since it is the 5th argument of the function signature.
The third form a submission can take is through using a SAM-formatted string. An example SAM-formatted string looks like ‘chr1:10-100’. As many users also copy & paste directly from tab-delimited files, such as BED files, a SAM-formatted string can take the form of ‘chr1\t10\t100’ where ‘\t’ indicates a tab space.
Lastly, the until_eof switch allows users to take all items from their desired start position (be it the whole reference or a specific spot on the reference). Setting this to True (default: False) will pull all reads to the end of the reference or file, whichever is first.
- Parameters
contig (str) – name of reference/contig
start (int) – start position of region of interest (0-based)
stop (int) – stop position of region of interest (0-based)
region (str) – SAM region formatted string. Accepts tab-delimited values as well
tid (int) – the refID or target id of a reference/contig
until_eof (bool) – iterate until end of file (default: False)
- Returns
region of interest formatted as an object with named attributes.
- Return type
query (
bamnostic.utils.Roi
)- Raises
ValueError – if two synonym keywords are set, but contradict each other
Examples
# Keyword-based >>> parse_region(contig = 'chr1', start = 10, stop = 100) Roi(contig: chr1, start: 10, stop: 100) # Using `tid` instead of `contig` >>> parse_region(tid = 0, start = 10, stop = 100) Roi(tid: 0, start: 10, stop: 100) # Positional arguments >>> parse_region('chr1', 10, 100) Roi(contig: chr1, start: 10, stop: 100) # SAM-formatted string (keyword) >>> parse_region(region = 'chr1:10-100') Roi(contig: chr1, start: 10, stop: 100) # SAM-formatted string (positional) >>> parse_region('chr1:10-100') Roi(contig: chr1, start: 10, stop: 100) # Tab-delimited region string >>> parse_region('chr1\t10\t100') Roi(contig: chr1, start: 10, stop: 100) # Contradictory synonyms >>> parse_region(reference='chr1', contig='chr10', start=10, stop = 100) Traceback (most recent call last): ... ValueError: either contig or reference must be set, not both
- bamnostic.utils.unpack(fmt, _io, is_array=False)[source]¶
Utility function for unpacking binary data from file object or byte stream.
The only difference between this method and struct.unpack is that unpack dynamically determines the size needed to read in based on the format string. Additionally, it can process a file object or byte stream and implement a read or slice (respectively). Mainly, this is a quality of life function.
- Parameters
fmt (str) – the string format of the binary data to be unpacked
_io – built-in binary format reader (default: io.BufferedRandom)
- Returns
unpacked contents from _io based on fmt string
- bamnostic.utils.make_virtual_offset(block_start_offset, within_block_offset)[source]¶
Compute a BGZF virtual offset from block start and within block offsets.
The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms:
block_start_offset << 16 | within_block_offset
Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer).
>>> make_virtual_offset(0, 0) 0
>>> make_virtual_offset(0, 1) 1
>>> make_virtual_offset(0, 2**16 - 1) 65535
>>> make_virtual_offset(0, 2**16) Traceback (most recent call last): ... ValueError: Require 0 <= within_block_offset < 2**16, got 65536
- bamnostic.utils.split_virtual_offset(virtual_offset)[source]¶
Divides a 64-bit BGZF virtual offset into block start & within block offsets.
>>> (100000, 0) == split_virtual_offset(6553600000) True
>>> (100000, 10) == split_virtual_offset(6553600010) True
- class bamnostic.utils.LruDict(*args, **kwargs)[source]¶
Bases:
collections.OrderedDict
Simple least recently used (LRU) based dictionary that caches a given number of items.
- get(key)[source]¶
Basic getter that renews LRU status upon inspection
- Parameters
key (str) – immutable dictionary key
- bamnostic.utils.parse_cigar(cigar_str)[source]¶
Parses a CIGAR string and turns it into a list of tuples
- Parameters
cigar_str (str) – the CIGAR string as shown in SAM entry
- Returns
list of tuples of CIGAR operations (by id) and number of operations
- Return type
cigar_array (list)
- Raises
ValueError – if CIGAR operation is invalid
Examples
>>> parse_cigar('3M1I3M1D5M') [(('BAM_CMATCH', 0), 3), ..., (('BAM_CMATCH', 0), 5)]
- bamnostic.utils.check_cigar_arg(cigar)[source]¶
Checks to make sure CIGAR arugment is valid.
- Parameters
argument (str or
list
) – CIGAR string (pre-formatted or raw)- Returns
CIGAR re-formatted as a list
- Return type
(
list
)- Raises
ValueError – if CIGAR is not a string or pre-formatted list
- bamnostic.utils.cigar_changes(seq, cigar)[source]¶
- Recreates the reference sequence to the extent that the CIGAR string can
represent.
- Parameters
seq (str) – aligned segment sequence
cigar (list) – list of tuples of cigar operations (by id) and number of operations
- Returns
a version of the aligned segment’s reference sequence given the changes reflected in the cigar string
- Return type
cigar_formatted_ref (str)
- Raises
ValueError – if CIGAR operation is invalid
Examples
>>> cigar_changes('ACTAGAATGGCT', '3M1I3M1D5M') 'ACTGAATGGCT'
>>> cigar_changes('ACTAGAATGGCT', '3V1I3M1D5M') Traceback (most recent call last): ... ValueError: Invalid CIGAR operation (V).
- bamnostic.utils.md_changes(seq, md_tag)[source]¶
Recreates the reference sequence of a given alignment to the extent that the MD tag can represent.
Note
Used in conjunction with cigar_changes to recreate the complete reference sequence
- Parameters
seq (str) – aligned segment sequence
md_tag (str) – MD tag for associated sequence
- Returns
a version of the aligned segment’s reference sequence given the changes reflected in the MD tag
- Return type
ref_seq (str)
- Raises
ValueError – if MD tag is None
Example
>>> md_changes('CTTATATTGGCCTT', '3C4AT4') 'CTTCTATTATCCTT'
- bamnostic.utils.ref_gen(seq, cigar_string, md_tag)[source]¶
Recreates the reference sequence associated with the given segment.
Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome. Example reads, MD tags, and CIGAR strings taken from David Tang’s Blog.
- Returns
generated reference sequence
- Return type
(str)
- Raises
KeyError – if read does not contain MD tag
Examples
# Only mismatches >>> seq = ‘CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG’ >>> cigar = ‘36M’ >>> md = ‘1A0C0C0C1T0C0T27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG’
# Insertions and mismatches >>> seq = ‘GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT’ >>> cigar = ‘6M1I29M’ >>> md = ‘0C1C0C1C0T0C27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT’
# Deletion and mismatches >>> seq = ‘AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC’ >>> cigar = ‘9M9D27M’ >>> md = ‘2G0A5^ATGATGTCA27’ >>> ref_gen(seq, cigar, md) ‘AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC’
# Insertion, deletion, and mismatches >>> seq = ‘AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA’ >>> cigar = ‘2M1I7M6D26M’ >>> md = ‘3C3T1^GCTCAG26’ >>> ref_gen(seq, cigar, md) ‘AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA’
- bamnostic.utils.cigar_alignment(seq=None, cigar=None, start_pos=0, qualities=None, base_qual_thresh=0, query=False)[source]¶
Use the CIGAR to filter out all unaligned data bases
Any clipping results in the removal of those bases. If an insertion is seen in the CIGAR, those bases are removed from the sequence. If a deletion is seen in the CIGAR, those bases are padded with a period (‘.’) symbol.
- Parameters
seq (str) – string sequence of the aligned segment.
cigar (str) – the cigar string or cigartuple of the aligned segment.
start_pos (int) – the first aligned position of the read
qualities (
array.array
) – base quality array from read
- Yields
(
tuple
ofstr
andint
) – nucleotide base and index position of that base relative to reference
Example
>>> seq = 'AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA' >>> cigar = '2M1I7M6D26M' >>> start_position = 95 >>> c_a = cigar_alignment(seq, cigar, start_position) >>> next(c_a) ('A', 95)