snpio.analysis package
Submodules
snpio.analysis.genotype_encoder module
- class snpio.analysis.genotype_encoder.GenotypeEncoder(genotype_data)[source]
Bases:
object
Encode genotypes to various formats suitable for machine learning.
This class provides methods to encode genotypes to various formats suitable for machine learning, including 012, one-hot, and integer encodings, as well as the inverse operations.
Example
>>> # Import necessary modules >>> from snpio import VCFReader, GenotypeEncoder >>> >>> # Initialize VCFReader and GenotypeEncoder objects >>> gd = VCFReader(filename="my_vcf.vcf", popmapfile="my_popmap.txt") >>> ge = GenotypeEncoder(gd) >>> >>> # Encode genotypes to 012, one-hot, and integer formats >>> gt_012 = ge.genotypes_012 >>> gt_onehot = ge.genotypes_onehot(gt_012) >>> gt_int = ge.genotypes_int(gt_012) >>> >>> # Inverse operations >>> ge.genotypes_012 = gt_012 >>> ge.genotypes_onehot = gt_onehot >>> ge.genotypes_int = gt_int
- plot_format
Plot format for the data.
- Type:
str
- prefix
Prefix for the output directory.
- Type:
str
- verbose
If True, display verbose output.
- Type:
bool
- snp_data
List of lists of SNPs.
- Type:
List[List[str]]
- samples
List of sample IDs.
- Type:
List[str]
- filetype
File type of the data.
- Type:
str
- missing_vals
List of missing values.
- Type:
List[str]
- replace_vals
List of values to replace missing values with.
- Type:
List[str]
- convert_012(snps)[source]
Encode IUPAC nucleotides as 0 (reference), 1 (heterozygous), and 2 (alternate) alleles.
This method encodes IUPAC nucleotides as 0 (reference), 1 (heterozygous), and 2 (alternate) alleles.
- Parameters:
snps (List[List[str]]) – 2D list of genotypes of shape (n_samples, n_sites).
- Returns:
Encoded 012 genotypes.
- Return type:
List[List[int]]
Warning
Monomorphic sites are detected and encoded as 0 (reference).
Non-biallelic sites are detected and forced to be bi-allelic.
Sites with all missing data are detected and excluded from the alignment.
- convert_int_iupac(snp_data, encodings_dict=None)[source]
Convert input data to integer-encoded format (0-9) based on IUPAC codes.
This method converts input data to integer-encoded format (0-9) based on IUPAC codes. The integer encoding is as follows: A=0, T=1, G=2, C=3, W=4, R=5, M=6, K=7, Y=8, S=9, N=-9.
- Parameters:
snp_data (numpy.ndarray of shape (n_samples, n_SNPs) or List[List[int]]) – Input 012-encoded data.
encodings_dict (Dict[str, int] or None) – Encodings to convert structure to phylip format.
- Returns:
Integer-encoded data.
- Return type:
numpy.ndarray
Note
If the data file type is “phylip” or “vcf” and
encodings_dict
is not provided, default encodings based on IUPAC codes are used.If the data file type is “structure” and
encodings_dict
is not provided, default encodings for alleles are used.Otherwise, if
encodings_dict
is provided, it will be used for conversion.
- convert_onehot(snp_data, encodings_dict=None)[source]
Convert input data to one-hot encoded format.
This method converts input data to one-hot encoded format.
- Parameters:
snp_data (Union[np.ndarray, List[List[int]]]) – Input 012-encoded data of shape (n_samples, n_SNPs).
encodings_dict (Optional[Dict[str, int]]) – Encodings to convert structure to phylip format. Defaults to None.
- Returns:
One-hot encoded data.
- Return type:
np.ndarray
Note
If the data file type is “phylip” and encodings_dict is not provided, default encodings for nucleotides are used.
If the data file type is “structure1row” or “structure2row” and encodings_dict is not provided, default encodings for alleles are used.
Otherwise, if encodings_dict is provided, it will be used for conversion.
Warning
If the data file type is “structure1row” or “structure2row” and encodings_dict is not provided, default encodings for alleles are used.
If the data file type is “phylip” and encodings_dict is not provided, default encodings for nucleotides are used.
If the data file type is “structure” and encodings_dict is not provided, default encodings for alleles are used.
- decode_012(X, write_output=True, is_nuc=False)[source]
Decode 012-encoded or 0-9 integer-encoded imputed data to STRUCTURE or PHYLIP format.
This method decodes 012-encoded or 0-9 integer-encoded imputed data to IUPAC format. The decoded data can be saved to a file or returned as a DataFrame.
- Parameters:
X (pandas.DataFrame, numpy.ndarray, or List[List[int]]) – Imputed data to decode, encoded as 012 or 0-9 integers.
write_output (bool, optional) – If True, save the decoded output to a file. If False, return the decoded data as a DataFrame. Defaults to True.
is_nuc (bool, optional) – Whether the encoding is based on nucleotides instead of 012. Defaults to False.
- Returns:
If write_output is True, returns the filename where the imputed data was written. If write_output is False, returns the decoded data as a DataFrame.
- Return type:
str or pandas.DataFrame
Todo
Check if VAE still uses IUPAC encodings.
- property genotypes_012: List[List[int]] | ndarray | DataFrame
Encoded 012 genotypes as a 2D list, numpy array, or pandas DataFrame.
This method encodes genotypes as 0 (reference), 1 (heterozygous), and 2 (alternate) alleles. The encoded genotypes are returned as a 2D list, numpy array, or pandas DataFrame.
- Returns:
encoded 012 genotypes.
- Return type:
List[List[int]], np.ndarray, or pd.DataFrame
Example
>>> gd = VCFReader(filename="snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz", popmapfile="snpio/example_data/popmaps/phylogen_nomx.popmap", force_popmap=True, chunk_size=5000, verbose=False) >>> ge = GenotypeEncoder(gd) >>> gt012 = ge.genotypes_012 >>> print(gt012) [["0", "1", "2"], ["0", "1", "2"], ["0", "1", "2"]]
- property genotypes_int: ndarray
Integer-encoded (0-9 including IUPAC characters) snps format.
Integer-encoded genotypes are returned as a 2D numpy array of shape (n_samples, n_sites). The integer encoding is as follows: A=0, T=1, G=2, C=3, W=4, R=5, M=6, K=7, Y=8, S=9, N=-9. Missing values are encoded as -9.
- Returns:
2D array of shape (n_samples, n_sites), integer-encoded from 0-9 with IUPAC characters.
- Return type:
numpy.ndarray
- property genotypes_onehot: ndarray
One-hot encoded snps format of shape (n_samples, n_loci, 4).
One-hot encoded genotypes are returned as a 3D numpy array of shape (n_samples, n_loci, 4). The one-hot encoding is as follows: A=[1, 0, 0, 0], T=[0, 1, 0, 0], G=[0, 0, 1, 0], C=[0, 0, 0, 1]. Missing values are encoded as [0, 0, 0, 0]. The one-hot encoding is based on the IUPAC ambiguity codes. Heterozygous sites are encoded as 0.5 for each allele.
- Returns:
One-hot encoded numpy array of shape (n_samples, n_loci, 4).
- Return type:
numpy.ndarray
- inverse_int_iupac(int_encoded_data, encodings_dict=None)[source]
Convert integer-encoded data back to original format.
This method converts integer-encoded data back to the original format based on IUPAC codes. The integer encoding is as follows: A=0, T=1, G=2, C=3, W=4, R=5, M=6, K=7, Y=8, S=9, N=-9.
- Parameters:
int_encoded_data (numpy.ndarray of shape (n_samples, n_SNPs) or List[List[int]]) – Input integer-encoded data.
encodings_dict (Dict[str, int] or None) – Encodings to convert from integer encoding to original format.
- Returns:
Original format data.
- Return type:
numpy.ndarray
Note
If the data file type is “phylip” or “vcf” and encodings_dict is not provided, default encodings based on IUPAC codes are used.
If the data file type is “structure” and encodings_dict is not provided, default encodings for alleles are used.
Otherwise, if encodings_dict is provided, it will be used for conversion
- inverse_onehot(onehot_data, encodings_dict=None)[source]
Convert one-hot encoded data back to original format.
- Parameters:
onehot_data (Union[np.ndarray, List[List[float]]]) – Input one-hot encoded data of shape (n_samples, n_SNPs).
encodings_dict (Optional[Dict[str, List[float]]]) – Encodings to convert from one-hot encoding to original format. Defaults to None.
- Returns:
Original format data.
- Return type:
np.ndarray
Note
If the data file type is “phylip” or “vcf” and encodings_dict is not provided, default encodings based on IUPAC codes are used.
If the data file type is “structure” and encodings_dict is not provided, default encodings for alleles are used.
Otherwise, if encodings_dict is provided, it will be used for conversion.
If the input data is a numpy array, it will be converted to a list of lists before decoding.
snpio.analysis.tree_builder module
- class snpio.analysis.tree_parser.TreeParser(genotype_data, treefile, qmatrix=None, siterates=None, verbose=False, debug=False)[source]
Bases:
GenotypeData
TreeParser class for reading and manipulating phylogenetic trees.
This class provides methods for reading, writing, and manipulating phylogenetic trees. The TreeParser class inherits from the GenotypeData class and provides additional functionality for working with phylogenetic trees. The TreeParser class can read phylogenetic trees from Newick or NEXUS format files, calculate basic statistics for the tree, extract subtrees, prune the tree, reroot the tree, and calculate pairwise distance matrices.
Example
>>> tp = TreeParser( ... genotype_data=gd_filt, ... treefile="snpio/example_data/trees/test.tre", ... qmatrix="snpio/example_data/trees/test.iqtree", ... siterates="snpio/example_data/trees/test14K.rate", ... show_plots=True, ... verbose=True, ... debug=False, ... ) >>> >>> tree = tp.read_tree() >>> print(tp.tree_stats()) >>> tp.reroot_tree("~EA") >>> print(tp.get_distance_matrix()) >>> print(tp.qmat) >>> print(tp.site_rates) >>> subtree = tp.get_subtree("~EA") >>> pruned_tree = tp.prune_tree("~ON") >>> print(tp.write_tree(subtree, save_path=None)) >>> print(tp.write_tree(pruned_tree, save_path=None)
- genotype_data
GenotypeData object containing the SNP data.
- Type:
GenotypeData
- treefile
Path to the phylogenetic tree file.
- Type:
str
- qmatrix
Path to the Q matrix file.
- Type:
str
- siterates
Path to the site rates file.
- Type:
str
- verbose
Whether to display verbose output.
- Type:
bool
- debug
Whether to display debug output.
- Type:
bool
- get_distance_matrix()[source]
Calculate the pairwise distance matrix between all tips in the tree.
This method computes the pairwise distance matrix between all nodes and tips in the phylogenetic tree. The distance matrix is returned as a pandas DataFrame object.
- Returns:
Pairwise distance matrix as a pandas DataFrame.
- Return type:
pd.DataFrame
- get_subtree(regex)[source]
Get a subtree rooted at a specified node or tip.
This method extracts a subtree from the phylogenetic tree rooted at the specified node or tip. The subtree is returned as a toytree object. The regex argument can be a regular expression to match the node or tip name. Regular expressions can be prefixed with ‘~’ to indicate taxa to keep.
- Parameters:
regex (int) – Regular expression to match the node or tip name. Regular expressions can be prefixed with ‘~’ to indicate taxa to keep.
- Returns:
The subtree rooted at the specified node.
- Return type:
toytree.tree
- load_tree_from_string(newick_str)[source]
Load a phylogenetic tree from a Newick string.
This method loads a phylogenetic tree from a Newick string and returns it as a toytree object.
- Parameters:
newick_str (str) – The Newick string representing the tree.
- Returns:
The loaded tree object.
- Return type:
toytree.tree
- prune_tree(taxa)[source]
Prune the tree by removing a set of taxa (leaf nodes).
This method prunes the tree by removing a set of taxa (leaf nodes) from the tree. The taxa argument can be a list of taxa names to remove from the tree or a regular expression to match the node or tip name. Regular expressions can be prefixed with ‘~’ to indicate taxa to keep.
- Parameters:
taxa (Union[List[str], str]) – List of taxa names to remove from the tree or a regular expression to match the node or tip name. Regular expressions can be prefixed with ‘~’ to indicate taxa to keep.
- Returns:
The pruned tree object.
- Return type:
toytree.tree
- property qmat: DataFrame
Get q-matrix object for a corresponding phylogenetic tree.
This method reads the Q matrix from a file and returns it as a pandas DataFrame object. The Q matrix file can be in either comma-separated or whitespace-separated format. The Q matrix should be a square matrix with columns and index in the order A, C, G, T.
- Returns:
The Q-matrix as a pandas DataFrame.
- Return type:
pandas.DataFrame
- read_tree()[source]
Read Newick or NEXUS-style phylogenetic tree into toytree object.
This method reads a phylogenetic tree from a file and returns it as a toytree object. The tree file can be in Newick or NEXUS format. If the tree file is not found or is unreadable, an exception is raised.
- Returns:
The input tree as a toytree object.
- Return type:
toytree.tree object
- Raises:
FileNotFoundError – If the tree file is not found.
PermissionError – If the tree file exists but is not readable.
- reroot_tree(node)[source]
Reroot the tree at a specific node or tip.
This method reroots the tree at a specific node or tip, changing the root of the tree to the specified node. The rerooted tree is returned as a toytree object.
- Parameters:
node (Union[int, str]) – Index of the node or tip where the tree should be rerooted, a regex string to match the node or tip name prefixed by “~”, or a list of node or tip names.
- Returns:
The rerooted tree.
- Return type:
toytree.tree
- property site_rates: DataFrame
Get site rate data for phylogenetic tree.
This method reads the site-specific substitution rates from a file and returns them as a list of float values. The site rates file should either contain the site rates in a single column, with each rate on a separate line, or in a table format with the rates in the ‘Rate’ column, as output by IQ-TREE. For example:
` 0.0000 0.0000 0.0000 0.0000 0.0000 `
OR:
` # Any comment lines can be included here. Site Rate Cat C_rate 1 0.0000 1 0.0000 2 0.0000 1 0.0000 3 0.0000 1 0.0000 4 0.0000 1 0.0000 5 0.0000 1 0.0000 `
- Returns:
Site rates for the phylogenetic tree.
- Return type:
pd.DataFrame
- property tree
Get newick tree from provided path.
This method reads the phylogenetic tree from the provided tree file path and returns it as a toytree object. If the tree file path is not provided, an exception is raised.
- Returns:
The toytree tree object.
- Return type:
toytree.tree
- tree_stats()[source]
Calculate basic statistics for the phylogenetic tree.
- Returns:
Dictionary containing tree statistics such as the number of tips, number of nodes, and total tree height.
- Return type:
Dict[str, Any]
- write_tree(tree, save_path=None, nexus=False)[source]
Write the phylogenetic tree to a file.
This method saves the phylogenetic tree to a file in Newick or NEXUS format. If the save_path argument is not provided, the tree is returned as a string representation.
- Parameters:
tree (toytree.tree) – The tree object to save.
save_path (str, optional) – Path to save the tree file. If not provided (left as None), then a string representation of the tree is returned. Defaults to None.
nexus (bool, optional) – Whether to save the tree in NEXUS format.If False, then Newick format is used. Defaults to False.
- Returns:
The string representation of the tree if save_path is None. Otherwise, None is returned.
- Return type:
Optional[str]
- Raises:
TypeError – If the input tree is not a toytree object.