cogent3.core.alignment.ArrayAlignment#
- class ArrayAlignment(*args, **kwargs)#
Holds a dense array representing a multiple sequence alignment.
An Alignment is _often_, but not necessarily, an array of chars. You might want to use some other data type for the alignment if you have a large number of symbols. For example, codons on an ungapped DNA alphabet has 4*4*4=64 entries so can fit in a standard char data type, but tripeptides on the 20-letter ungapped protein alphabet has 20*20*20=8000 entries so can _not_ fit in a char and values will wrap around (i.e. you will get an unpredictable, wrong value for any item whose index is greater than the max value, e.g. 255 for uint8), so in this case you would need to use UInt16, which can hold 65536 values. DO NOT USE SIGNED DATA TYPES FOR YOUR ALIGNMENT ARRAY UNLESS YOU LOVE MISERY AND HARD-TO-DEBUG PROBLEMS.
Implementation: aln[i] returns position i in the alignment.
aln.positions[i] returns the same as aln[i] – usually, users think of this as a ‘column’, because alignment editors such as Clustal typically display each sequence as a row so a position that cuts across sequences is a column.
aln.seqs[i] returns a sequence, or ‘row’ of the alignment in standard terminology.
WARNING: aln.seqs and aln.positions are different views of the same array, so if you change one you will change the other. This will no longer be true if you assign to seqs or positions directly, so don’t do it. If you want to change the data in the whole array, always assign to a slice so that both views update: aln.seqs[:] = x instead of aln.seqs = x. If you get the two views out of sync, you will get all sorts of exceptions. No validation is performed on aln.seqs and aln.positions for performance reasons, so this can really get you into trouble.
Alignments are immutable, though this is not enforced. If you change the data after the alignment is created, all sorts of bad things might happen.
Class properties: alphabet: should be an Alphabet object. Must provide mapping between items (possibly, but not necessarily, characters) in the alignment and indices of those characters in the resulting Alignment object.
SequenceType: Constructor to use when building sequences. Default: Sequence
Creating a new array will always result in a new object unless you use the force_same_object=True parameter.
WARNING: Rebinding the names attribute in a ArrayAlignment is not recommended because not all methods will use the updated name order. This is because the original sequence and name order are used to produce data structures that are cached for efficiency, and are not updated if you change the names attribute.
WARNING: ArrayAlignment strips off info objects from sequences that have them, primarily for efficiency.
- Attributes:
- annotation_db
- named_seqs
num_seqs
Returns the number of sequences in the alignment.
positions
Override superclass positions to return positions as symbols.
- seqs
Methods
add_from_ref_aln
(ref_aln[, before_name, ...])Insert sequence(s) to self based on their alignment to a reference sequence.
add_seqs
(other[, before_name, after_name])Returns new object of class self with sequences from other added.
alignment_quality
([app_name])Computes the alignment quality using the indicated app
apply_pssm
([pssm, path, background, ...])scores sequences using the specified pssm
coevolution
([method, segments, drawable, ...])performs pairwise coevolution measurement
copy
()Returns deep copy of self.
count_gaps_per_pos
([include_ambiguity])return counts of gaps per position as a DictArray
count_gaps_per_seq
([induced_by, unique, ...])return counts of gaps per sequence as a DictArray
counts
([motif_length, include_ambiguity, ...])counts of motifs
counts_per_pos
([motif_length, ...])return DictArray of counts per position
counts_per_seq
([motif_length, ...])counts of non-overlapping motifs per sequence
deepcopy
([sliced])Returns deep copy of self.
degap
(**kwargs)Returns copy in which sequences have no gaps.
distance_matrix
([calc, show_progress, ...])Returns pairwise distances between sequences.
dotplot
([name1, name2, window, threshold, ...])make a dotplot between specified sequences.
entropy_per_pos
([motif_length, ...])returns shannon entropy per position
entropy_per_seq
([motif_length, ...])returns the Shannon entropy per sequence
filtered
(predicate[, motif_length, ...])The alignment positions where predicate(column) is true.
get_ambiguous_positions
()Returns dict of seq:{position:char} for ambiguous chars.
get_degapped_relative_to
(name)Remove all columns with gaps in sequence with given name.
get_gap_array
([include_ambiguity])returns bool array with gap state True, False otherwise
get_gapped_seq
(seq_name[, recode_gaps])Return a gapped Sequence object for the specified seqname.
get_identical_sets
([mask_degen])returns sets of names for sequences that are identical
get_lengths
([include_ambiguity, allow_gap])returns {name: seq length, ...}
get_motif_probs
([alphabet, ...])Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
get_position_indices
(f[, native, negate])Returns list of column indices for which f(col) is True.
get_seq
(seqname)Return a sequence object for the specified seqname.
get_seq_indices
(f[, negate])Returns list of keys of seqs where f(row) is True.
get_similar
(target[, min_similarity, ...])Returns new Alignment containing sequences similar to target.
get_sub_alignment
([seqs, pos, negate_seqs, ...])Returns subalignment of specified sequences and positions.
get_translation
([gc, incomplete_ok, ...])translate from nucleic acid to protein
has_terminal_stop
([gc, strict])Returns True if any sequence has a terminal stop codon.
information_plot
([width, height, window, ...])plot information per position
is_ragged
()Returns True if alignment has sequences of different lengths.
iter_positions
([pos_order])Iterates over positions in the alignment, in order.
iter_selected
([seq_order, pos_order])Iterates over elements in the alignment.
iter_seqs
([seq_order])Iterates over values (sequences) in the alignment, in order.
iupac_consensus
([alphabet, allow_gaps])Returns string containing IUPAC consensus sequence of the alignment.
majority_consensus
()Returns list containing most frequent item at each position.
matching_ref
(ref_name, gap_fraction, gap_run)Returns new alignment with seqs well aligned with a reference.
no_degenerates
([motif_length, allow_gap])returns new alignment without degenerate characters
omit_bad_seqs
([quantile])Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
omit_gap_pos
([allowed_gap_frac, motif_length])Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
omit_gap_runs
([allowed_run])Returns new alignment where all seqs have runs of gaps <=allowed_run.
omit_gap_seqs
([allowed_gap_frac])Returns new alignment with seqs that have <= allowed_gap_frac.
pad_seqs
([pad_length])Returns copy in which sequences are padded to same length.
probs_per_pos
([motif_length, ...])returns MotifFreqsArray per position
probs_per_seq
([motif_length, ...])return MotifFreqsArray per sequence
quick_tree
([calc, bootstrap, drop_invalid, ...])Returns pairwise distances between sequences.
rc
()Returns the reverse complement alignment
rename_seqs
(renamer)returns new instance with sequences renamed
replace_seqs
(seqs[, aa_to_codon])Returns new alignment with same shape but with data taken from seqs.
reverse_complement
()Returns the reverse complement alignment.
sample
([n, with_replacement, motif_length, ...])Returns random sample of positions from self, e.g.
seqlogo
([width, height, wrap, vspace, colours])returns Drawable sequence logo using mutual information
set_repr_policy
([num_seqs, num_pos, ...])specify policy for repr(self)
sliding_windows
(window, step[, start, end])Generator yielding new alignments of given length and interval.
strand_symmetry
([motif_length])returns dict of strand symmetry test results per seq
take_positions
(cols[, negate])Returns new Alignment containing only specified positions.
take_positions_if
(f[, negate])Returns new Alignment containing cols where f(col) is True.
take_seqs
(seqs[, negate])Returns new Alignment containing only specified seqs.
take_seqs_if
(f[, negate])Returns new Alignment containing seqs where f(row) is True.
to_dict
()Returns the alignment as dict of names -> strings.
to_dna
()returns copy of self as an alignment of DNA moltype seqs
to_fasta
()Return alignment in Fasta format
to_html
([name_order, wrap, limit, ref_name, ...])returns html with embedded styles for sequence colouring
to_json
()returns json formatted string
to_moltype
(moltype)returns copy of self with moltype seqs
to_nexus
(seq_type[, wrap])Return alignment in NEXUS format and mapping to sequence ids
to_phylip
()Return alignment in PHYLIP format and mapping to sequence ids
to_pretty
([name_order, wrap])returns a string representation of the alignment in pretty print format
to_protein
()returns copy of self as an alignment of PROTEIN moltype seqs
to_rich_dict
()returns detailed content including info and moltype attributes
to_rna
()returns copy of self as an alignment of RNA moltype seqs
to_type
([array_align, moltype, alphabet])returns alignment of type indicated by array_align
trim_stop_codons
([gc, strict])Removes any terminal stop codons from the sequences
variable_positions
([include_gap_motif])Return a list of variable position indexes.
with_modified_termini
()Changes the termini to include termini char instead of gapmotif.
write
([filename, format])Write the alignment to a file, preserving order of sequences.