bamnostic package

bamnostic is an OS-agnostic, Pure Python BAM file parser.

The purpose of bamnostic is to open up BAM query capabilities to all current OS environments. Special care was taken to allow bamnostic to run in all versions of Python 2.7 onward. Furthermore, bamnostic does not require any packages outside of the standard library. Therefore, bamnostic will run on all stable versions of PyPy.

Note

SAM and CRAM support is not yet implemented

The three main classes of bamnostic are:

  1. bamnostic.AlignmentFile: the BAM file handler

  2. bamnostic.AlignedSegment: an aligned read object interface

  3. bamnostic.bai.Bai: if the BAM file has an associated index file (preferred), this is the file handler for it.

Note

Within the scope of personal research, reading BAM files is the only fully supported IO. The skeleton for writing BAM files is present, just not connected.

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

Core Module (bamnostic.core)

class bamnostic.core.Cigar(op_code, n_op, op_id, op_name)

Bases: tuple

namedtuple for handling CIGAR data

Parameters
  • op_code (int) – CIGAR operation index

  • n_op (int) – number of operations for a given op_code

  • op_id (str) – the string representation of the CIGAR representation (‘MIDNSHP=XB’)

  • op_name (str) – Longer string name for operation

property n_op

Alias for field number 1

property op_code

Alias for field number 0

property op_id

Alias for field number 2

property op_name

Alias for field number 3

bamnostic.core.offset_qual(qual_string)[source]

Offsets the ASCII-encoded quality string to represent PHRED score.

Every base that is in the alignment is assigned a Phred score. A Phred score (\(Q\)) is defined as \(Q=-10\log_{10}P\), where \(P\) is the base-calling error probability. Phred quality scores tend range from 10 to 60. These qualities are then offset by 33 and ASCII-encoded for readability and storage.

Parameters

qual_string (str or bytes) – Phred quality scores without offset

Returns

ASCII-encoded Phred scores offest by adding 33 to base score.

Return type

(str)

Examples

>>> qual_score = ''
>>> ''.join(offset_qual(qual_score))
'<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;'
class bamnostic.core.AlignmentFile(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False, compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Bases: bamnostic.bam.BamReader, bamnostic.bam.BamWriter

Wrapper to allow drop in replacement for BAM functionality in a pysam-like API.

Parameters
  • filepath_or_object (str | file) – the path or file object of the BAM file

  • mode (str) – Mode for reading. BAM files are binary by nature (default: ‘rb’).

  • max_cache (int) – number of desired LRU cache size, preferably a multiple of 2 (default: 128).

  • index_filename (str) – path to index file (BAI) if it is named differently than the BAM file (default: None).

  • filename (str | file) – synonym for filepath_or_object

  • check_header (bool) – Obsolete method maintained for backwards compatibility (default: False)

  • check_sq (bool) – Inspect BAM file for @SQ entries within the header

  • reference_filename (str) – Not implemented. Maintained for backwards compatibility

  • filepath_index (str) – synonym for index_filename

  • require_index (bool) – require the presence of an index file or raise (default: False)

  • duplicate_filehandle (bool) – Not implemented. Raises warning if True.

  • ignore_truncation (bool) – Whether or not to allow trucated file processing (default: False).

class bamnostic.core.AlignedSegment(_io)[source]

Bases: object

Main class for handling reads within the BAM

property query_alignment_end

One past the final index position of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

property query_alignment_start

The starting index of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

property query_name

Alias for the read name

property query_alignment_sequence

The sequence of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

property query_alignment_length

The length of the aligned portion of the read

query_alignment_* all refer to the portion of the read that was aligned, and therefore exclude clipping, but include insertions.

property reference_start

The starting index of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

property reference_end

One past the final index position of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

property reference_length

The length of the aligned portion of the read

reference_ here means the portion of the read that was aligned according to the reference. Therefore, does not include insertions and clipping

property query_sequence

The sequence of the read

query_ here means the query as seen in the BAM file, and therefore includes clipping

property query_length

The length of sequence of the read

query_ here means the query as seen in the BAM file, and therefore includes clipping

property next_reference_id

Return the reference id of mate read

property next_reference_name

Return the reference name of mate read

property next_reference_start

Return the reference name of mate read

property is_duplicate

If the read is set as PCR or optical duplicate according to read flag

property is_paired

If the read is set as paired according to read flag

property is_proper_pair

If each segment properly aligned according to the aligner

property is_qcfail

If the read is set as QC fail according to read flag

property is_read1

If the read is set as the first read of mate pair according to read flag (implies the read is mated)

property is_read2

If the read is set as the second read of mate pair according to read flag (implies the read is mated)

property is_reverse

If the read is set as reversed according to read flag

property is_secondary

If the read is set as secondary alignment according to read flag

property is_supplementary

If the read is set as a supplementary alignment according to read flag

property is_unmapped

If the read is set as unmapped according to read flag

property mate_is_reverse

Check the read flag to see if the mat is reverse strand (implies the read is mated)

property mate_is_unmapped

Check the read flag to see if mate is unmapped (implies the read is mated)

property mapping_quality

The mapping quality (MAPQ) of the read

get_tag(tag, with_value_type=False)[source]

Gets the value associated with a given tag key.

Parameters
  • tag (str) – the tag of interest

  • with_value_type (bool) – return what kind of value the tag

Returns

the value associated with a given tag or the value and type of value (as seen in BAM format)

get_tags(with_value_type=False)[source]

Returns all the tags for a given read

Parameters

with_value_type (bool) – return the tag value type (as defined by BAM format)

Returns

list of tag tuples (with or without tag value type)

Return type

f_tags(list)

get_cigar_stats()[source]

Gets the counts of each CIGAR operation in the read and number of nucleotides related to those given operations.

Returns

list of CIGAR operation counts nt_counts (list): list of nucleotide counts for each operation

Return type

op_blocks (list)

ref_gen()[source]

Recreates the reference sequence associated with the given segment.

Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome.

Returns

generated reference sequence

Return type

(str)

Raises

KeyError – if read does not contain MD tag

to_bam()[source]

Writes the alignment record to a BAM file

Parameters

bam_file (string or bamnostic.bam.BamWriter) – BAM file path or open bam file in a write mode

bamnostic.core.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.core.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Bgzip BAM Format (BGZF) file handlers (bamnostic.bgzf)

bamnostic.bgzf.unpack_gzip_header(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_subfields(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_gzip_integrity(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_bgzf_metaheader(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.get_block(handle, offset=0)[source]

Pulls out entire GZIP block

Used primarily for copying the header block of a BAM file. However, it can be used to copy any BGZF block within a BAM file that starts at the given offset.

Note

Does not progress file cursor position.

Parameters
  • handle (file) – open BAM file

  • offset (int) – offset of BGZF block (default: 0)

Returns

Complete BGZF block

Raises

ValueError – if the BGZF block header is malformed

Example

>>> bam_header = get_block(bamnostic.example_bam)
>>> try:
...     bam_header.startswith(b'\x1f\x8b\x08\x04')
... except SyntaxError:
...     bam_header.startswith('\x1f\x8b\x08\x04')
True
class bamnostic.bgzf.BgzfReader(filepath_or_object, mode='rb', max_cache=None, filename=None, duplicate_filehandle=None, ignore_truncation=False)[source]

Bases: object

The BAM reader. Heavily modified from Peter Cock’s BgzfReader.

tell()[source]

Return a 64-bit unsigned BGZF virtual offset.

seek(virtual_offset)[source]

Seek to a 64-bit unsigned BGZF virtual offset.

A virtual offset is a composite number made up of the compressed offset (coffset) position of the start position of the BGZF block that the position originates within, and the uncompressed offset (uoffset) within the deflated BGZF block where the position starts. The virtual offset is defined as

virtual_offset = coffset << 16 | uoffset

Parameters

virtual_offset (int) – 64-bit unsigned composite byte offset

Returns

an echo of the new position

Return type

virtual_offset (int)

Raises
  • ValueError – if within block offset is more than block size

  • AssertionError – if the start position is not the block start position

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.seek(10)
10
>>> bam.seek(bamnostic.utils.make_virtual_offset(0, 42))
Traceback (most recent call last):
    ...
ValueError: Within offset 42 but block size only 38
read(size=- 1)[source]

Read method for the BGZF module.

Parameters

size (int) – the number of bytes to read from file. Advances the cursor.

Returns

byte string of length size

Return type

data (bytes)

Raises
  • NotImplementedError – if the user tries to read the whole file

  • AssertionError – if read does not return any data

readline()[source]

Read a single line for the BGZF file.

Binary operations do not support readline(). Code is commented out for posterity sake

close()[source]

Close BGZF file.

seekable()[source]

Return True indicating the BGZF supports random access.

Note

Modified from original Bio.BgzfReader: checks to see if BAM file has associated index file (BAI)

isatty()[source]

Return True if connected to a TTY device.

fileno()[source]

Return integer file descriptor.

class bamnostic.bgzf.BgzfWriter(filepath_or_object, mode='wb', compresslevel=6)[source]

Bases: object

Define a BGZFWriter object.

write(data)[source]

Write method for the class.

flush()[source]

Flush data explicitly.

close()[source]

Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.

samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.

tell()[source]

Return a BGZF 64-bit virtual offset.

seekable()[source]

Return True indicating the BGZF supports random access.

isatty()[source]

Return True if connected to a TTY device.

fileno()[source]

Return integer file descriptor.

bamnostic.bgzf.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bgzf.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Binary Alignment Map (BAM) file handlers (bamnostic.bam)

class bamnostic.bam.BAMheader(_io)[source]

Bases: object

Parse and store the BAM file header

The BAM header is the plain text and byte-encoded metadata of a given BAM file. Information stored in the header are the number, length, and name of the reference sequences that reads were aligned to; version of software used; read group identifiers; etc. The BAM format also stipulates that the first block of any BAM file should be reserved just for the BAM header block.

_io

opened BAM file object

Type

file

SAMheader

parsed dictionary of the SAM header

Type

dict

n_refs

number of references

Type

int

refs

reference names and lengths listed in the BAM header

Type

dict

to_header()[source]

Allows the user to directly copy the header of another BAM file

Returns

packed byte code of entire header BGZF block

Return type

(bytesarray)

class bamnostic.bam.BamReader(filepath_or_object, mode='rb', max_cache=128, index_filename=None, filename=None, check_header=False, check_sq=True, reference_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=None, ignore_truncation=False)[source]

Bases: bamnostic.bgzf.BgzfReader

The BAM reader. Heavily modified from Peter Cock’s BgzfReader.

header

representation of header data (if present)

lengths

lengths of references listed in header

Type

list of int

nocoordinate

number of reads that have no coordinates

Type

int

nreferences

number of references in header

Type

int

ref2tid

refernce names and refID dictionary

Type

dict of str, int

references

names of references listed in header

Type

list of str

text

SAM header (if present)

Type

str

unmapped

number of unmapped reads

Type

int

Note

This implementation is likely to change. While the API was meant to mirror pysam, it makes sense to include the pysam-like API in an extension that will wrap the core reader. This would be a major refactor, and therefore will not happen any time soon (30 May 2018).

check_index(index_filename=None, req_idx=False)[source]

Checks to make sure index file is available. If not, it disables random access.

Parameters
  • index_filename (str) – path to index file (BAI) if it does not fit naming convention (default: None).

  • req_idx (bool) – Raise error if index file is not present (default: False).

Returns

True if index is present, else False

Return type

(bool)

Raises

IOError – If the index file is closed or index could not be opened

Warns

UserWarning – If index could not be loaded. Random access is disabled.

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam)
>>> bam.check_index(bamnostic.example_bam + '.bai')
True
>>> bam.check_index('not_a_file.bai')
False
property nocoordinate

Get the number of reads without coordiantes according to the statistics recorded in the index.

Returns

sum of reads without coordinates

Return type

(int)

property mapped

Get the number of mapped reads according to the statistics recorded in the index.

Returns

sum of mapped reads

Return type

(int)

property unmapped

Get the number of unmapped reads without coordiantes and without coordinate.

Returns

sum of unmapped reads and reads without coordinates

Return type

(int)

property references

Get all references used in alignment.

Returns

reference names

Return type

(tuple of str)

property nreferences

Get the number of references listed in alignment

Returns

count of references

Return type

(int)

property lengths

Get all reference lengths used in alignment.

Returns

reference lengths

Return type

(tuple of int)

has_index()[source]

Checks if file has index and it is open

Returns

True if present and opened, else False

Return type

bool

fetch(contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)[source]

Creates a generator that returns all reads within the given region. (inclusive, exclusive)

Parameters
  • contig (str) – name of reference/contig

  • start (int) – start position of region of interest (0-based)

  • stop (int) – stop position of region of interest (0-based)

  • region (str) – SAM region formatted string. Accepts tab-delimited values as well

  • tid (int) – the refID or target id of a reference/contig

  • until_eof (bool) – iterate until end of file

  • mutiple_iterators (bool) – allow multiple iterators over region. Not Implemented. Notice: each iterator will open up a new view into the BAM file, so overhead will apply.

  • reference (str) – synonym for contig

  • end (str) – synonym for stop

Yields

reads over the region of interest if any

Raises
  • ValueError – if the genomic coordinates are out of range or invalid

  • KeyError – Reference is not found in header

Note

SAM region formatted strings take on the following form: ‘chr1:100000-200000’

Usage:
AlignmentFile.fetch(contig='chr1', start=1, stop= 1000)
AlignmentFile.fetch('chr1', 1, 1000)
AlignmentFile.fetch('chr1:1-1000')
AlignmentFile.fetch('chr1', 1)
AlignmentFile.fetch('chr1')

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> next(bam.fetch('chr1', 1, 100)) 
EAS56_57:6:190:289:82 ... MF:C:192
>>> next(bam.fetch('chr10', 1, 10))
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
>>> next(bam.fetch('chr1', 1700, 1701))
Traceback (most recent call last):
    ...
ValueError: Genomic region out of bounds.
>>> next(bam.fetch('chr1', 100, 10))
Traceback (most recent call last):
    ...
AssertionError: Malformed region: start should be <= stop, you entered 100, 10
count(contig=None, start=None, stop=None, region=None, until_eof=False, tid=None, read_callback='nofilter', reference=None, end=None)[source]

Count the number of reads in the given region

Note: this counts the number of reads that overlap the given region.

Can potentially make use of a filter for the reads (or custom function that returns True or False for each read).

Parameters
  • contig (str) – the reference name (Default: None)

  • reference (str) – synonym for contig (Default: None)

  • start (int) – 0-based inclusive start position (Default: None)

  • stop (int) – 0-based exclusive start position (Default: None)

  • end (int) – Synonym for stop (Default: None)

  • region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)

  • until_eof (bool) – count number of reads from start to end of file Note, this can potentially be an expensive operation. (Default: False)

  • read_callback (str|function) –

    select (or create) a filter of which reads to count. Built-in filters:

    • all: skips reads that contain the following flags:

      • 0x4 (4): read unmapped

      • 0x100 (256): not primary alignment

      • 0x200 (512): QC Fail

      • 0x400 (1024): PCR or optical duplicate

    • nofilter: uses all reads (Default)

    • The user can also supply a custom function that

      returns boolean objects for each read

Returns

count of reads in the given region that meet parameters

Return type

(int)

Raises
  • ValueError – if genomic coordinates are out of range or invalid or random access is disabled

  • RuntimeError – if read_callback is not properly set

  • KeyError – Reference is not found in header

  • AssertionError – if genomic region is malformed

Notes

SAM region formatted strings take on the following form: ‘chr1:100000-200000’

Usage:

AlignmentFile.count(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count(‘chr1’, 1, 1000) AlignmentFile.count(‘chr1:1-1000’) AlignmentFile.count(‘chr1’, 1) AlignmentFile.count(‘chr1’)

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.count('chr1', 1, 100)
2
>>> bam.count('chr1', 1, 100, read_callback='all')
1
>>> bam.count('chr10', 1, 10)
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
>>> bam.count('chr1', 1700, 1701)
Traceback (most recent call last):
    ...
ValueError: Genomic region out of bounds.
>>> bam.count('chr1', 100, 10)
Traceback (most recent call last):
    ...
AssertionError: Malformed region: start should be <= stop, you entered 100, 10
count_coverage(contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None, base_quality_threshold=0)[source]

Counts the coverage of each base supported by a read the given interval.

Given an interval (inclusive, exclusive), this method pulls each read that overlaps with the region. To ensure that the read truly overlaps with the region, the CIGAR string is required. These reads can further be filtered out by their flags, MAPQ qualities, or custom filtering function. Using the CIGAR string, the aligned portion of the read is traversed and the presence of each nucleotide base is tallied into respective arrays. Additionally, the user can choose to filter the counted bases based on its base quality score that is stored in the quality string.

Parameters
  • contig (str) – the reference name (Default: None)

  • reference (str) – synonym for contig (Default: None)

  • start (int) – 0-based inclusive start position (Default: None)

  • stop (int) – 0-based exclusive start position (Default: None)

  • end (int) – Synonym for stop (Default: None)

  • region (str) – SAM-style region format. Example: ‘chr1:10000-50000’ (Default: None)

  • quality_threshold (int) – MAPQ quality threshold (Default: 15)

  • base_quality_threshold (int) – base quality score threshold (Default: 0)

  • read_callback (str|function) –

    select (or create) a filter of which reads to count. Built-in filters:

    • all: skips reads that contain the following flags:
      • 0x4 (4): read unmapped

      • 0x100 (256): not primary alignment

      • 0x200 (512): QC Fail

      • 0x400 (1024): PCR or optical duplicate

    • nofilter: uses all reads (Default)

    • The user can also supply a custom function that returns boolean objects for each read

Returns

Four arrays in the order of A, C, G, T

Return type

(array.array)

Raises
  • ValueError – if genomic coordinates are out of range or invalid, random access is disabled, or nucleotide base is unrecognized

  • RuntimeError – if read_callback is not properly set

  • KeyError – Reference is not found in header

  • AssertionError – if genomic region is malformed

Notes

SAM region formatted strings take on the following form: ‘chr1:100-200’

Usage:

AlignmentFile.count_coverage(contig=’chr1’, start=1, stop= 1000) AlignmentFile.count_coverage(‘chr1’, 1, 1000) AlignmentFile.count_coverage(‘chr1:1-1000’) AlignmentFile.count_coverage(‘chr1’, 1) AlignmentFile.count_coverage(‘chr1’)

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> for arr in bam.count_coverage('chr1', 100, 150): 
...     print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist()))))
array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0])
array('L', [0, 0, 0, 0, ..., 15, 0, 14, 0, 0])
array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 14])
array('L', [0, 0, 0, 0, ..., 0, 15, 0, 14, 0])
>>> for arr in bam.count_coverage('chr1', 100, 150, quality_threshold=20, base_quality_threshold=25): 
...     print("array('{}', {})".format(arr.typecode, list(map(int, arr.tolist()))))
array('L', [0, 0, 0, 0, ..., 0, 0, 0, 0, 0])
array('L', [0, 0, 0, 0, ..., 13, 0, 14, 0, 0])
array('L', [1, 1, 2, 2, ..., 0, 0, 0, 0, 13])
array('L', [0, 0, 0, 0, ..., 0, 14, 0, 13, 0])
get_index_stats()[source]

Inspects the index file (BAI) for alignment statistics.

Every BAM index file contains metrics regarding the alignment process for the given BAM file. The stored data are the number of mapped and unmapped reads for a given reference. Unmapped reads are paired end reads where only one part is mapped. Additionally, index files also contain the number of unplaced unmapped reads. This is stored within the nocoordinate instance attribute (if present).

Returns

list of tuples for each reference in the order seen in the header. Each tuple contains the number of mapped reads, unmapped reads, and the sum of both.

Return type

idx_stats (list of tuple)

Raises

AssertionError – if the index file is not available

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_index_stats()
[(1446, 18, 1464), (1789, 17, 1806)]
>>> bam_no_bai = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb', index_filename='not_a_file.bai')
>>> bam_no_bai.get_index_stats()
Traceback (most recent call last):
    ...
AssertionError: No index available
is_valid_tid(tid)[source]

Return True if TID/RefID is valid.

Returns

True if TID/refID is valid, else False

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.is_valid_tid(0)
True
>>> bam.is_valid_tid(10) # because there are only 2 in this file
False
get_reference_name(tid)[source]

Convert TID/refID to reference name.

The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.

Parameters

tid (int) – TID/refID of desired reference/contig

Returns

String representation of chromosome if valid, else None

Raises

KeyError – if TID/refID is not valid

Examples

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_reference_name(0) 
'chr1'
>>> bam.get_reference_name(10)
Traceback (most recent call last):
    ...
KeyError: '10 is not a valid TID/refID for this file.'
get_tid(reference)[source]

Convert reference/contig name to refID/TID.

The TID/refID is the position a reference sequence is seen within the header file of the BAM file. The references are sorted by ASCII order. Therefore, for a Homo sapien aligned to GRCh38, ‘chr10’ comes before ‘chr1’ in the header. Therefore, ‘chr10’ would have the TID/refID of 0, not ‘chr1’.

Parameters

reference (str) – reference/contig name

Returns

the TID/refID of desired reference/contig

Return type

(int)

Raises

KeyError – if reference name not found file header

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.get_tid('chr1')
0
>>> bam.get_tid('chr10')
Traceback (most recent call last):
    ...
KeyError: 'chr10 was not found in the file header'
mate(AlignedSegment)[source]

Gets the mate to a given AlignedSegment.

Note

Slow, when compared to the C-API. Not meant for high-throughput analysis.

Does not advance current iterator position.

Parameters

AlignedSegment (bamnostic.AlignedSegment) – a bamnostic AlignedSegment read with a mate

Returns

if read has a valid mate, else None

Return type

(bamnostic.AlignedSegment)

Raises

ValueError – if AlignedSegment is unpaired

head(n=5, multiple_iterators=False)[source]

List out the first n reads of the file.

This method is primarily used when doing an initial exploration of the data. Whether or not multiple_iterators is used, cursor position within the file will not change.

Note

Using multiple_interators opens a new file object of the same file currently in use and, thus, impacts the memory footprint of your analysis.

Parameters
  • n (int) – number of aligned reads to print (default: 5)

  • mutliple_iterators (bool) – Whether to use current file object or create a new one (default: False).

Returns

list of n reads from the front of the BAM file

Return type

head_reads (list of AlignedSegment)

Example

>>> bam = bamnostic.AlignmentFile(bamnostic.example_bam, 'rb')
>>> bam.head(n=5, multiple_iterators = False)[0] 
EAS56_57:6:190:289:82       ...     MF:C:192
>>> bam.head(n = 5)[1] 
EAS56_57:6:190:289:82       ...     UQ:C:0
next()[source]

Return the next line.

seekable()[source]

Return True indicating the BGZF supports random access.

Note

Modified from original Bio.BgzfReader: checks to see if BAM file has associated index file (BAI)

class bamnostic.bam.BamWriter(filepath_or_object, mode='xb', compresslevel=6, ignore_overwrite=False, copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Bases: bamnostic.bgzf.BgzfWriter

write_header(copy_header=None, header=b'', reference_names=None, reference_lengths=None)[source]

Modified code from Peter Cock’s [BamWriter](https://github.com/peterjc/biopython/blob/SamBam2015/Bio/Sequencing/SamBam/__init__.py#L1714)

Writes the header into a BAM file

Parameters
  • copy_header (filepath_or_object) – copy the header from another BAM file

  • header (bytes) – SAM-style header

  • reference_names (list of str) – list of all reference names used in file

  • reference_lengths (list of int) – list of all reference lengths used in file

write(data)[source]

Write method for the class.

bamnostic.bam.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bam.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

BAM Index (BAI) file handler (bamnostic.bai)

BAI file parser

Copyright © 2018, Marcus D. Sherman

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

The Binary Alignment Map (BAM) format (defined at https://samtools.github.io/hts-specs/SAMv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a BAM index (BAI) file is created. Generally speaking, a BAI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the BAI is produced, random access to the BAM is available.

This script is for parsing the binary encoded BAI file, inherently making it human readable. Furthermore, it allows subsections of the BAI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic

bamnostic.bai.unpack_chunk(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_intervals(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_bid_nchunk(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_unmapped(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

class bamnostic.bai.RefIdx(start_offset, end_offset, n_bins)

Bases: tuple

property end_offset

Alias for field number 1

property n_bins

Alias for field number 2

property start_offset

Alias for field number 0

bamnostic.bai.reg2bin(rbeg, rend)[source]

Finds the largest superset bin of region. Numeric values taken from hts-specs

Parameters
  • rbeg (int) – inclusive beginning position of region

  • rend (int) – exclusive end position of region

Returns

distinct bin ID for largest superset bin of region

Return type

(int)

bamnostic.bai.reg2bins(rbeg, rend)[source]

Generates bin ids which overlap the specified region.

Parameters
  • rbeg (int) – inclusive beginning position of region

  • rend (int) – exclusive end position of region

Yields

(int) – bin IDs for overlapping bins of region

Raises

AssertionError (Exception) – if the range is malformed or invalid

class bamnostic.bai.Bai(filename)[source]

Bases: object

This class defines the bam index file object and its interface.

The purpose of this class is the binary parsing of the bam index file (BAI) associated with a given bam file. When queried, the Bai object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.

Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)

_io

opened BAI file object

Type

fileObject

_LINEAR_INDEX_WINDOW

constant of the linear interval window size

Type

int

_UNMAP_BIN

constant for bin ID of unmapped read stats

Type

int

magic

first 4 bytes of file. Must be equal to b’BAI’

Type

bytes

n_refs

number of references in BAI

Type

int

unmapped

dictionary of the unmapped read stats by each reference

Type

dict

current_ref

dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.

Type

None|dict

ref_indices

dictionary of reference ids and their start/stop offsets within the BAI file

Type

dict

n_no_coord

if present in BAI, is the number of reads that have no coordinates

Type

None|int

_last_pos

used for indexing, the byte position of the file head.

Type

int

get_chunks(n_chunks)[source]

Simple generator for unpacking chunk data

Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.

Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.

Parameters

n_chunks (int) – number of chunks to be unpacked from stream

Returns

a list of Chunk objects with the attributes of

chunks[i] are .voffset_beg and voffset_end

Return type

chunks (list)

get_ints(n_int)[source]

Unpacks n_int number of interval virtual offsets from stream

A linear interval is defined as a 16384 bp window along a given reference. The value stored in the BAI is the virtual offset of the first read within that given interval. This virtual offset is the byte offset (coffset) of the start of the BGZF block that contains the beginning of the read and the byte offset (uoffset) within the uncompressed data of the residing BGZF block to that first read.

Note: a caveat to using linear interval with long reads: A long read can span multiple linear intervals. As such, the current encoding could potentially shift the expected region of interest to the left more than expected.

Parameters

n_int (int) – number of intervals to unpack

Returns

list of virtual offsets for n_int number of linear intervals

Return type

intervals (list)

get_bins(n_bins, ref_id=None, idx=False)[source]

Simple function that iteratively unpacks the data of a number (n_bin) of bins.

As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in the BAI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.

As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True

Parameters

n_int (int) – number of bins to be unpacked from stream

Returns

None if just indexing the index file or a dictionary

of bin_id: chunks pairs

Return type

bins (None | dict)

Raises

AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly

get_ref(ref_id=None, idx=False)[source]

Interatively unpacks all the bins, linear intervals, and chunks for a given reference

A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.

This function also serves to “index” the BAI file such that, if it is invoked by setting ids=True, will do a single pass through the BAI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing the BAI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.

Parameters
  • ref_id (None|int) – used for random access or indexing the BAI

  • idx (bool) – Flag for setting whether or not to run an index of the BAI

Returns

namedtuple containing the byte offsets of the reference start, stop, and number of bins or Ref: namedtuple containing a dictionary of bins and list of linear intervals

Return type

RefIdx

Raises

AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.

query(ref_id, start, stop=- 1)[source]

Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start

Parameters
  • ref (int) – which reference/chromosome TID

  • start (int) – left most bp position of region (zero-based)

  • stop (int) – right most bp position of region (zero-based)

Returns

the voffset_beg of the first chunk given the chunk’s voffset_end

is greater than the voffset of the linear index that overlaps the region of interest’s start offset

Return type

(int)

seek(offset=None, whence=0)[source]

Simple seek function for binary files

Parameters
  • offset (None|int) – byte offset from whence to move the file head to.

  • whence (int) – 0 := from start of file, 1:= from current position, 2:= from end of file

Returns

new byte position of file head

Return type

(int)

Raises

ValueError (Exception) – if the offset is not an integer or is not provided

read(size=- 1)[source]

Simple read function for binary files

Parameters

size (int) – number of bytes to read in (default: -1 –whole file)

Returns

the number of bytes read from file

Return type

(bytes)

tell()[source]

Simple tell function for reporting byte position of file head

Returns

byte position of file head

Return type

(int)

bamnostic.bai.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.bai.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

Coordinate-Sorted Index (CSI) file handler (bamnostic.csi)

CSI file parser

Copyright © 2018, Marcus D. Sherman

This code is part of the bamnostic distribution and governed by its license. Please see the LICENSE file that should have been included as part of this package.

The Binary Alignment Map (CSI) format (defined at http://samtools.github.io/hts-specs/CSIv1.pdf) allows for indexing. When the user invokes a tool to index a given BAM file, a CSI index file can be created. Generally speaking, a CSI contains the all the virtual offsets of clusters of reads that are associated with specific subsections of the BAM file. When the CSI is produced, random access to the BAM is available.

This script is for parsing the binary encoded CSI file, inherently making it human readable. Furthermore, it allows subsections of the CSI to be loaded into memory, reducing the memory footprint and speeding up queries within a small number of references. Lastly, by parsing it as such, random access queries directly into the associated BAM file is available to other tools within bamnostic

bamnostic.csi.unpack_chunk(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_bid_loffset_nchunk(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_unmapped(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.reg2bin_csi(rbeg, rend, min_shift=14, depth=5)[source]

Finds the largest superset bin of region. Numeric values taken from hts-specs

Parameters
  • rbeg (int) – inclusive beginning position of region

  • rend (int) – exclusive end position of region

Returns

distinct bin ID for largest superset bin of region

Return type

(int)

bamnostic.csi.reg2bins_csi(rbeg, rend, min_shift=14, depth=5)[source]

Generates bin ids which overlap the specified region.

Parameters
  • rbeg (int) – inclusive beginning position of region

  • rend (int) – exclusive end position of region

Yields

(int) – bin IDs for overlapping bins of region

Raises

AssertionError (Exception) – if the range is malformed or invalid

class bamnostic.csi.Csi(filename)[source]

Bases: bamnostic.bai.Bai

This class defines the bam CSI index file object and its interface.

The purpose of this class is the binary parsing of the bam index file (CSI) associated with a given bam file. When queried, the Csi object identifies the bins of data that overlap the requested region and directs which parts of the bam file contain it.

Virtual offsets are processed using the following method: Beginning of compressed block = coffset = virtual offset >> 16 Position within uncompressed block = uoffset = virtual offset ^ (coffset << 16)

_io

opened CSI file object

Type

fileObject

_UNMAP_BIN

constant for bin ID of unmapped read stats

Type

int

magic

first 4 bytes of file. Must be equal to b’CSI’

Type

bytes

n_refs

number of references in CSI

Type

int

unmapped

dictionary of the unmapped read stats by each reference

Type

dict

current_ref

dictionary of the current reference loaded into memory. It contains the a dictionary of bin IDs and their respective chunks, and a list of linear intervals.

Type

None|dict

ref_indices

dictionary of reference ids and their start/stop offsets within theCSI file

Type

dict

n_no_coord

if present inCSI, is the number of reads that have no coordinates

Type

None|int

_last_pos

used for indexing, the byte position of the file head.

Type

int

get_chunks(n_chunks)[source]

Simple generator for unpacking chunk data

Chunks are defined as groupings of reads within a BAM file that share the same bin. A Chunk object in the context of this function is a namedtuple that contains the virtual offsets for the beginning and end of each of these chunks.

Note: a special case of a chunk is in any Bin labeled as 37450. These bins always contain 2 chunks that provide the statistics of the number of reads that are unmapped to that reference.

Parameters

n_chunks (int) – number of chunks to be unpacked from stream

Returns

a list of Chunk objects with the attributes of

chunks[i] are .voffset_beg and voffset_end

Return type

chunks (list)

get_bins(n_bins, ref_id=None, idx=False)[source]

Simple function that iteratively unpacks the data of a number (n_bin) of bins.

As the function discovers the number of chunks needed for a given bin, it deletages work to self.get_chunks(n_chunks). A bin is comprised of 2 parts: 1) the distinct bin ID (within a given reference). If no reads are associated with that bin, it is left out of the indexing process, and therefore not represented in theCSI file. Furthermore, while each bin has a bin ID, the bin IDs are only distinct within a given reference. This means that 2 or more references can have the same bin IDs. These bin IDs are also not in any order as they are essentially a hash dump. Lastly, the only reserved bin ID is 37450. This bin relates to 2 chunks that contain the number of unmapped and mapped reads for a given reference. 2) the chunk(s) of reads that are assigned to a given bin.

As a secondary feature, this function will also quickly seek over regions for the purposes of documenting the start and stop byte offsets of a given reference block within the file. This is invoked by setting idx=True

Parameters

n_int (int) – number of bins to be unpacked from stream

Returns

None if just indexing the index file or a dictionary

of bin_id: chunks pairs

Return type

bins (None | dict)

Raises

AssertionError (Exception) – if bin 37450 does not contain 2 chunks exactly

get_ref(ref_id=None, idx=False)[source]

Iteratively unpacks all the bins, linear intervals, and chunks for a given reference

A reference is comprised of 2 things: 1) a series of bins that reference chunks of aligned reads that are grouped within that bin. 2) a series of virtual offsets of the first read of a 16384 bp window along the given reference.

This function also serves to “index” theCSI file such that, if it is invoked by setting ids=True, will do a single pass through theCSI file and saving the start and stop offsets of each of the references. This is used for minimizing the memory footprint of storing theCSI in memory. When queried against, the appropriate reference block will be loaded. Because of this constant loading, functools.lru_cache was applied to cache recently used reference blocks to speed up computation. It is assumed that when querying is done, most users are looking and just a few references at a time.

Parameters
  • ref_id (None|int) – used for random access or indexing theCSI

  • idx (bool) – Flag for setting whether or not to run an index of theCSI

Returns

namedtuple containing the byte offsets of the reference start, stop, and number of bins o Ref: namedtuple containing a dictionary of bins and list of linear intervals

Return type

RefIdx

Raises

AssertionError (Exception) – if, when random access is used, the current reference offset does not match indexed reference offset.

query(ref_id, start, stop=- 1)[source]

Main query function for determining seek offset to BAM section that AlignedRead objects from specified region start

Parameters
  • ref (int) – which reference/chromosome TID

  • start (int) – left most bp position of region (zero-based)

  • stop (int) – right most bp position of region (zero-based)

Returns

the voffset_beg of the first chunk given the chunk’s voffset_end

is greater than the voffset of the linear index that overlaps the region of interest’s start offset

Return type

(int)

bamnostic.csi.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.csi.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

BAMnostic Utilities (bamnostic.utils)

bamnostic.utils.format_warnings(message, category, filename, lineno, file=None, line=None)[source]

Sets STDOUT warnings

Parameters
  • message – the unformatted warning message being reported

  • category (str) – the level of warning (handled by warnings module)

  • filename (str) – filename for logging purposes (defaults to STDOUT)

  • lineno (int) – where the error occurred.

Returns

formatted warning string

bamnostic.utils.unpack_int32(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

bamnostic.utils.unpack_int32L(buffer, /)

Return a tuple containing unpacked values.

Unpack according to the format string Struct.format. The buffer’s size in bytes must be Struct.size.

See help(struct) for more on format strings.

class bamnostic.utils.Roi(contig=None, start=None, stop=None, tid=None)[source]

Bases: object

Small __slots__ class for region of interest parsing

bamnostic.utils.flag_decode(flag_code)[source]

Simple read alignment flag decoder

Every read within a BAM file ought to have an associated flag code. Theses flags are used for read filtering and QC. The flags are described below. Additionally, they can be found here

Any given read’s flag is determined by the or (|) operand of all appropriate bit flags.

Parameters

flag_code (int) – either a standalone integer/bit flag or the read object itself

Returns

list of flag and flag description tuples.

Return type

(list of tuple)

Raises

ValueError – if provided flag is not a valid entry

Example

If a flag is 516 it is comprised of flag 4 and flag 512

>>> flag_decode(516)
[(4, 'read unmapped'), (512, 'QC fail')]

Int

Bit

Description

1

0x1

Template having multiple segments in sequencing

2

0x2

Each segment properly aligned according to the aligner

4

0x4

Segment unmapped

8

0x8

Next segment in the template unmapped

16

0x10

SEQ being reverse complemented

32

0x20

SEQ of the next segment in the template being reverse complemented

64

0x40

The first segment in the template

128

0x80

The last segment in the template

256

0x100

Secondary alignment

512

0x200

Not passing filters, such as platform/vendor quality controls

1024

0x400

PCR or optical duplicate

2048

0x800

Supplementary alignment

bamnostic.utils.yes_no(message)[source]

Simple prompt parser

bamnostic.utils.parse_region(contig=None, start=None, stop=None, region=None, tid=None, reference=None, end=None, until_eof=False)[source]

Parses region information from all user set parameters.

The main goal of this function is to handle the many different ways a user can put in genomic region data. One way is through keyword arguments. This is the most straight forward. However, due to Pysam’s API, there are multiple synonyms for reference/contig or stop/end. Additionally, if the user knows the refID/TID of the reference they are interested in, they can input that way as well.

The second form a submission can take is through positional arguments. Just like keyword, but ordered such that it makes up a genomic region of interest.

Note

Positional arguments make utilizing the tid parameter difficult since it is the 5th argument of the function signature.

The third form a submission can take is through using a SAM-formatted string. An example SAM-formatted string looks like ‘chr1:10-100’. As many users also copy & paste directly from tab-delimited files, such as BED files, a SAM-formatted string can take the form of ‘chr1\t10\t100’ where ‘\t’ indicates a tab space.

Lastly, the until_eof switch allows users to take all items from their desired start position (be it the whole reference or a specific spot on the reference). Setting this to True (default: False) will pull all reads to the end of the reference or file, whichever is first.

Parameters
  • contig (str) – name of reference/contig

  • start (int) – start position of region of interest (0-based)

  • stop (int) – stop position of region of interest (0-based)

  • region (str) – SAM region formatted string. Accepts tab-delimited values as well

  • tid (int) – the refID or target id of a reference/contig

  • until_eof (bool) – iterate until end of file (default: False)

Returns

region of interest formatted as an object with named attributes.

Return type

query (bamnostic.utils.Roi)

Raises

ValueError – if two synonym keywords are set, but contradict each other

Examples

# Keyword-based
>>> parse_region(contig = 'chr1', start = 10, stop = 100)
Roi(contig: chr1, start: 10, stop: 100)

# Using `tid` instead of `contig`
>>> parse_region(tid = 0, start = 10, stop = 100)
Roi(tid: 0, start: 10, stop: 100)

# Positional arguments
>>> parse_region('chr1', 10, 100)
Roi(contig: chr1, start: 10, stop: 100)

# SAM-formatted string (keyword)
>>> parse_region(region = 'chr1:10-100')
Roi(contig: chr1, start: 10, stop: 100)

# SAM-formatted string (positional)
>>> parse_region('chr1:10-100')
Roi(contig: chr1, start: 10, stop: 100)

# Tab-delimited region string
>>> parse_region('chr1\t10\t100')
Roi(contig: chr1, start: 10, stop: 100)

# Contradictory synonyms
>>> parse_region(reference='chr1', contig='chr10', start=10, stop = 100)
Traceback (most recent call last):
...
ValueError: either contig or reference must be set, not both
bamnostic.utils.unpack(fmt, _io, is_array=False)[source]

Utility function for unpacking binary data from file object or byte stream.

The only difference between this method and struct.unpack is that unpack dynamically determines the size needed to read in based on the format string. Additionally, it can process a file object or byte stream and implement a read or slice (respectively). Mainly, this is a quality of life function.

Parameters
  • fmt (str) – the string format of the binary data to be unpacked

  • _io – built-in binary format reader (default: io.BufferedRandom)

Returns

unpacked contents from _io based on fmt string

bamnostic.utils.make_virtual_offset(block_start_offset, within_block_offset)[source]

Compute a BGZF virtual offset from block start and within block offsets.

The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms:

block_start_offset << 16 | within_block_offset

Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer).

>>> make_virtual_offset(0, 0)
0
>>> make_virtual_offset(0, 1)
1
>>> make_virtual_offset(0, 2**16 - 1)
65535
>>> make_virtual_offset(0, 2**16)
Traceback (most recent call last):
...
ValueError: Require 0 <= within_block_offset < 2**16, got 65536
bamnostic.utils.split_virtual_offset(virtual_offset)[source]

Divides a 64-bit BGZF virtual offset into block start & within block offsets.

>>> (100000, 0) == split_virtual_offset(6553600000)
True
>>> (100000, 10) == split_virtual_offset(6553600010)
True
class bamnostic.utils.LruDict(*args, **kwargs)[source]

Bases: collections.OrderedDict

Simple least recently used (LRU) based dictionary that caches a given number of items.

get(key)[source]

Basic getter that renews LRU status upon inspection

Parameters

key (str) – immutable dictionary key

update(others)[source]

Same as a regular dict.update, however, since pypy’s dict.update doesn’t go through dict.__setitem__, this is used to ensure it does

Parameters

others (iterable) – a dictionary or iterable containing key/value pairs

cull()[source]

Main utility function for pruning the LruDict

If the length of the LruDict is more than max_cache, it removes the LRU item

bamnostic.utils.parse_cigar(cigar_str)[source]

Parses a CIGAR string and turns it into a list of tuples

Parameters

cigar_str (str) – the CIGAR string as shown in SAM entry

Returns

list of tuples of CIGAR operations (by id) and number of operations

Return type

cigar_array (list)

Raises

ValueError – if CIGAR operation is invalid

Examples

>>> parse_cigar('3M1I3M1D5M') 
[(('BAM_CMATCH', 0), 3), ..., (('BAM_CMATCH', 0), 5)]
bamnostic.utils.check_cigar_arg(cigar)[source]

Checks to make sure CIGAR arugment is valid.

Parameters

argument (str or list) – CIGAR string (pre-formatted or raw)

Returns

CIGAR re-formatted as a list

Return type

(list)

Raises

ValueError – if CIGAR is not a string or pre-formatted list

bamnostic.utils.cigar_changes(seq, cigar)[source]
Recreates the reference sequence to the extent that the CIGAR string can

represent.

Parameters
  • seq (str) – aligned segment sequence

  • cigar (list) – list of tuples of cigar operations (by id) and number of operations

Returns

a version of the aligned segment’s reference sequence given the changes reflected in the cigar string

Return type

cigar_formatted_ref (str)

Raises

ValueError – if CIGAR operation is invalid

Examples

>>> cigar_changes('ACTAGAATGGCT', '3M1I3M1D5M')
'ACTGAATGGCT'
>>> cigar_changes('ACTAGAATGGCT', '3V1I3M1D5M')
Traceback (most recent call last):
    ...
ValueError: Invalid CIGAR operation (V).
bamnostic.utils.md_changes(seq, md_tag)[source]

Recreates the reference sequence of a given alignment to the extent that the MD tag can represent.

Note

Used in conjunction with cigar_changes to recreate the complete reference sequence

Parameters
  • seq (str) – aligned segment sequence

  • md_tag (str) – MD tag for associated sequence

Returns

a version of the aligned segment’s reference sequence given the changes reflected in the MD tag

Return type

ref_seq (str)

Raises

ValueError – if MD tag is None

Example

>>> md_changes('CTTATATTGGCCTT', '3C4AT4')
'CTTCTATTATCCTT'
bamnostic.utils.ref_gen(seq, cigar_string, md_tag)[source]

Recreates the reference sequence associated with the given segment.

Uses the CIGAR string and MD tag to recreate the reference sequence associated with the aligned segment. This is done without the need for looking up the reference genome. Example reads, MD tags, and CIGAR strings taken from David Tang’s Blog.

Returns

generated reference sequence

Return type

(str)

Raises

KeyError – if read does not contain MD tag

Examples

# Only mismatches >>> seq = ‘CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG’ >>> cigar = ‘36M’ >>> md = ‘1A0C0C0C1T0C0T27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG’

# Insertions and mismatches >>> seq = ‘GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT’ >>> cigar = ‘6M1I29M’ >>> md = ‘0C1C0C1C0T0C27’ >>> ref_gen(seq, cigar, md) ‘CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT’

# Deletion and mismatches >>> seq = ‘AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC’ >>> cigar = ‘9M9D27M’ >>> md = ‘2G0A5^ATGATGTCA27’ >>> ref_gen(seq, cigar, md) ‘AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC’

# Insertion, deletion, and mismatches >>> seq = ‘AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA’ >>> cigar = ‘2M1I7M6D26M’ >>> md = ‘3C3T1^GCTCAG26’ >>> ref_gen(seq, cigar, md) ‘AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA’

bamnostic.utils.cigar_alignment(seq=None, cigar=None, start_pos=0, qualities=None, base_qual_thresh=0, query=False)[source]

Use the CIGAR to filter out all unaligned data bases

Any clipping results in the removal of those bases. If an insertion is seen in the CIGAR, those bases are removed from the sequence. If a deletion is seen in the CIGAR, those bases are padded with a period (‘.’) symbol.

Parameters
  • seq (str) – string sequence of the aligned segment.

  • cigar (str) – the cigar string or cigartuple of the aligned segment.

  • start_pos (int) – the first aligned position of the read

  • qualities (array.array) – base quality array from read

Yields

(tuple of str and int) – nucleotide base and index position of that base relative to reference

Example

>>> seq = 'AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA'
>>> cigar = '2M1I7M6D26M'
>>> start_position = 95
>>> c_a = cigar_alignment(seq, cigar, start_position)
>>> next(c_a)
('A', 95)