snpio.read_input package
Submodules
snpio.read_input.genotype_data module
- class snpio.read_input.genotype_data.GenotypeData(filename=None, filetype=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, loci_indices=None, sample_indices=None, chunk_size=1000, logger=None, debug=False)[source]
Bases:
BaseGenotypeData
A class for handling and analyzing genotype data.
The GenotypeData class is intended as a parent class for the file reader classes, such as VCFReader, StructureReader, and PhylipReader. It provides common methods and attributes for handling genotype data, such as reading population maps, subsetting data, and generating missingness reports.
Note
- GenotypeData handles the following characters as missing data:
‘N’
‘-’
‘?’
‘.’
If using PHYLIP or STRUCTURE formats, all sites will be forced to be biallelic. If multiple alleles are needed, you must use a VCF file.
- inputs
GenotypeData keyword arguments as a dictionary.
- Type:
dict
- num_snps
Number of SNPs in the dataset.
- Type:
int
- num_inds
Number of individuals in the dataset.
- Type:
int
- populations
Population IDs.
- Type:
List[Union[str, int]]
- popmap
Dictionary object with SampleIDs as keys and popIDs as values.
- Type:
dict
- popmap_inverse
Inverse dictionary of popmap, where popIDs are keys and lists of sampleIDs are values.
- Type:
dict or None
- samples
Sample IDs in input order.
- Type:
List[str]
- snpsdict
Dictionary with SampleIDs as keys and lists of genotypes as values.
- Type:
dict or None
- snp_data
Genotype data as a 2D list.
- Type:
List[List[str]]
- loci_indices
Column indices for retained loci in filtered alignment.
- Type:
List[int]
- sample_indices
Row indices for retained samples in the alignment.
- Type:
List[int]
- ref
List of reference alleles of length num_snps.
- Type:
List[str]
- alt
List of alternate alleles of length num_snps.
- Type:
List[str]
- iupac_mapping
Mapping of allele tuples to IUPAC codes.
- Type:
dict
- reverse_iupac_mapping
Mapping of IUPAC codes to allele tuples.
- Type:
dict
- missing_vals
List of missing value characters.
- Type:
List[str]
- replace_vals
List of missing value replacements.
- Type:
List[pd.NA]
- logger
Logger object.
- Type:
logging.Logger
- debug
If True, display debug messages.
- Type:
bool
- plot_kwargs
Plotting keyword arguments.
- Type:
dict
- supported_filetypes
List of supported filetypes.
- Type:
List[str]
- kwargs
GenotypeData keyword arguments.
- Type:
dict
- chunk_size
Chunk size for reading in large files.
- Type:
int
- plot_format
Format to save report plots.
- Type:
str
- plot_fontsize
Font size for plots.
- Type:
int
- plot_dpi
Resolution in dots per inch for plots.
- Type:
int
- plot_despine
If True, remove the top and right spines from plots.
- Type:
bool
- show_plots
If True, display plots in the console.
- Type:
bool
- prefix
Prefix to use for output directory.
- Type:
str
- verbose
If True, display verbose output.
- Type:
bool
- read_popmap()[source]
Read population map from file to map samples to populations.
- subset_with_popmap()[source]
Subset popmap and samples based on population criteria.
- write_popmap()[source]
Write the population map to a file.
- missingness_reports()[source]
Generate missingness reports and plots.
- _make_snpsdict()[source]
Make a dictionary with SampleIDs as keys and a list of SNPs associated with the sample as the values.
- _genotype_to_iupac()[source]
Convert a genotype string to its corresponding IUPAC code.
- _iupac_to_genotype()[source]
Convert an IUPAC code to its corresponding genotype string.
- calc_missing()[source]
Calculate missing value statistics based on a DataFrame.
- copy()[source]
Create a deep copy of the GenotypeData object.
- read_popmap()[source]
Read in a popmap file.
- missingness_reports()[source]
Create missingness reports from GenotypeData object.
- _report2file()[source]
Write a DataFrame to a CSV file.
- _genotype_to_iupac()[source]
Convert a genotype string to its corresponding IUPAC code.
- _iupac_to_genotype()[source]
Convert an IUPAC code to its corresponding genotype string.
- get_reverse_iupac_mapping()[source]
Create a reverse mapping from IUPAC codes to allele tuples.
Example
>>> gd = GenotypeData(file="data.vcf", filetype="vcf", popmapfile="popmap.txt") >>> print(gd.snp_data) [['A', 'C', 'G', 'T'], ['A', 'C', 'G', 'T'], ['A', 'C', 'G', 'T']] >>> print(gd.num_snps) 4 >>> print(gd.num_inds) 3 >>> print(gd.populations) ['pop1', 'pop2', 'pop2'] >>> print(gd.popmap) {'sample1': 'pop1', 'sample2': 'pop2', 'sample3': 'pop2'} >>> print(gd.samples) ['sample1', 'sample2', 'sample3']
- property alt: List[str]
Get list of alternate alleles of length num_snps.
- Returns:
List of alternate alleles of length num_snps.
- Return type:
List[str]
- calc_missing(df, use_pops=True)[source]
Calculate missing value statistics based on a DataFrame.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing genotype data.
use_pops (bool, optional) – If True, calculate statistics per population. Defaults to True.
- Returns:
A tuple of missing value statistics:
loc (pd.Series): Missing value proportions per locus.
ind (pd.Series): Missing value proportions per individual.
poploc (Optional[pd.DataFrame]): Missing value proportions per population and locus. Only returned if use_pops=True.
poptot (Optional[pd.Series]): Missing value proportions per population. Only returned if use_pops=True.
indpop (Optional[pd.DataFrame]): Missing value proportions per individual and population. Only returned if use_pops=True.
- Return type:
Tuple[pd.Series, pd.Series, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.DataFrame]]
- copy()[source]
Create a deep copy of the GenotypeData or VCFReader object.
- Returns:
A new object with the same attributes as the original.
- Return type:
GenotypeData or VCFReader
- get_reverse_iupac_mapping()[source]
Creates a reverse mapping from IUPAC codes to allele tuples.
- Returns:
Mapping of IUPAC codes to allele tuples
- Return type:
Dict[str, Tuple[str, str]]
- property inputs: Dict[str, Any]
Get GenotypeData keyword arguments as a dictionary.
- Returns:
GenotypeData keyword arguments as a dictionary.
- Return type:
Dict[str, Any]
- property loci_indices: ndarray
Boolean array for retained loci in filtered alignment.
- Returns:
Boolean array of loci indices, with True for retained loci and False for excluded loci.
- Return type:
np.ndarray
- Raises:
TypeError – If the loci_indices attribute is not a numpy.ndarray or list.
TypeError – If the loci_indices attribute is not a numpy.dtype ‘bool’.
- missingness_reports(prefix=None, zoom=True, bar_color='gray', heatmap_palette='magma')[source]
Generate missingness reports and plots.
The function will write several comma-delimited report files:
individual_missingness.csv: Missing proportions per individual.
locus_missingness.csv: Missing proportions per locus.
population_missingness.csv: Missing proportions per population (only generated if popmapfile was passed to GenotypeData).
population_locus_missingness.csv: Table of per-population and per-locus missing data proportions.
A file missingness.<plot_format> will also be saved. It contains the following subplots:
Barplot with per-individual missing data proportions.
Barplot with per-locus missing data proportions.
Barplot with per-population missing data proportions (only if popmapfile was passed to GenotypeData).
Heatmap showing per-population + per-locus missing data proportions (only if popmapfile was passed to GenotypeData).
Stacked barplot showing missing data proportions per individual.
Stacked barplot showing missing data proportions per population (only if popmapfile was passed to GenotypeData).
If popmapfile was not passed to GenotypeData, then the subplots and report files that require populations are not included.
- Parameters:
prefix (str, optional) – Output file prefix for the missingness report. Defaults to None.
zoom (bool, optional) – If True, zoom in to the missing proportion range on some of the plots. If False, the plot range is fixed at [0, 1]. Defaults to True.
bar_color (str, optional) – Color of the bars on the non-stacked bar plots. Can be any color supported by matplotlib. See the matplotlib.pyplot.colors documentation. Defaults to ‘gray’.
heatmap_palette (str, otpional) – Color palette for the heatmap plot. Defaults to ‘magma’.
- Return type:
None
- property num_inds: int
Number of individuals (samples) in dataset.
- Returns:
Number of individuals (samples) in input data.
- Return type:
int
- property num_snps: int
Number of snps (loci) in the dataset.
- Returns:
Number of SNPs (loci) per individual.
- Return type:
int
- property popmap: Dict[str, str]
Dictionary object with SampleIDs as keys and popIDs as values.
- Returns:
Dictionary with SampleIDs as keys and popIDs as values
- Return type:
Dict[str, str]
- property popmap_inverse: Dict[str, List[str]]
Inverse popmap dictionary with populationIDs as keys and lists of sampleIDs as values.
- Returns:
Inverse dictionary of popmap, where popIDs are keys and lists of sampleIDs are values.
- Return type:
Dict[str, List[str]]
- property populations: List[str | int]
Population IDs as a list of strings or integers.
- Returns:
Population IDs.
- Return type:
List[Union[str, int]]
- read_popmap()[source]
Read population map from file to map samples to populations.
Makes use of the ReadPopmap class to read in the popmap file and validate the samples against the alignment.
- Return type:
None
- Sets the following attributes:
samples
populations
popmap
popmap_inverse
sample_indices
- property ref: List[str]
Get list of reference alleles of length num_snps.
- Returns:
List of reference alleles of length num_snps.
- Return type:
List[str]
- property sample_indices: ndarray
Row indices for retained samples in alignemnt.
- Returns:
Boolean array of sample indices, with True for retained samples and False for excluded samples.
- Return type:
np.ndarray
- Raises:
TypeError – If the sample_indices attribute is not a numpy.ndarray or list.
TypeError – If the sample_indices attribute is not a numpy.dtype ‘bool’.
- property samples: List[str]
List of sample IDs in input order.
- Returns:
Sample IDs in input order.
- Return type:
List[str]
- set_alignment(snp_data, samples, sample_indices, loci_indices)[source]
Set the alignment data and sample IDs.
- Parameters:
snp_data (np.ndarray) – 2D array of genotype data.
samples (List[str]) – List of sample IDs.
sample_indices (np.ndarray) – Boolean array of sample indices.
loci_indices (np.ndarray) – Boolean array of locus indices.
- Return type:
None
Note
This method is used to set the alignment data and sample IDs after filtering.
- The method updates the following attributes:
snp_data
samples
populations
popmap
popmap_inverse
sample_indices
loci_indices
num_inds
num_snps
prefix
- property snp_data: ndarray
Get the genotypes as a 2D list of shape (n_samples, n_loci).
- Returns:
2D array of IUPAC encoded genotype data.
- Return type:
np.ndarray
- Raises:
TypeError – If the snp_data attribute is not a numpy.ndarray, pandas.DataFrame, or list.
- property snpsdict: Dict[str, List[str]]
Dictionary with Sample IDs as keys and lists of genotypes as values.
- Returns:
Dictionary with sample IDs as keys and lists of genotypes as values.
- Return type:
Dict[str, List[str]]
- subset_with_popmap(my_popmap, samples, force, include_pops=None, exclude_pops=None, return_indices=False)[source]
Subset popmap and samples based on population criteria.
- Parameters:
my_popmap (ReadPopmap) – ReadPopmap instance.
samples (List[str]) – List of sample IDs.
force (bool) – If True, force the subsetting. If False, raise an error if the samples don’t align.
include_pops (Optional[List[str]]) – List of populations to include. If provided, only samples belonging to these populations will be retained.
exclude_pops (Optional[List[str]]) – List of populations to exclude. If provided, samples belonging to these populations will be excluded.
return_indices (bool, optional) – If True, return the indices for samples. Defaults to False.
- Returns:
Boolean array of sample_indices if return_indices is True. Otherwise, None.
- Return type:
Optional[np.ndarray]
- write_popmap(filename)[source]
Write the population map to a file.
- Parameters:
filename (str) – Output file path.
- Raises:
AttributeError – If the samples attribute is not defined.
AttributeError – If the popmap attribute is not defined.
- Return type:
None
snpio.read_input.popmap_file module
- class snpio.read_input.popmap_file.ReadPopmap(filename, logger, verbose=False)[source]
Bases:
object
Class to read and parse a population map file.
Population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs. There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs.
Alternatively, tthe header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’).
The population map file should not contain any duplicate SampleIDs.
Example
Example population map file format:
` sampleID,populationID Sample1,Population1 Sample2,Population1 Sample3,Population2 Sample4,Population2 `
>>> from snpio.read_input.popmap_file import ReadPopmap >>> pm = ReadPopmap("popmap.txt", logger, verbose=True) >>> pm.get_pop_counts(genotype_data) >>> pm.validate_popmap(samples, force=True) >>> pm.subset_popmap(samples, include=["Population1"]) >>> pm.write_popmap("subset_popmap.txt") >>> print(pm.popmap) {'Sample1': 'Population1', 'Sample2': 'Population1'} >>> print(pm.inverse_popmap): {'Population1': ['Sample1', 'Sample2']}
- filename
Filename for the population map.
- Type:
str
- verbose
Verbosity setting (True or False). If True, enables verbose output. If False, suppresses verbose output.
- Type:
bool
- _popdict
Dictionary with SampleIDs as keys and the corresponding population ID as values.
- Type:
Dict[str, str]
- _sample_indices
Boolean array representing the subset samples.
- Type:
np.ndarray
- logger
Logger object.
- Type:
logging
- read_popmap()[source]
Read a population map file from disk into a dictionary object.
- write_popmap()[source]
Write the population map dictionary to a file.
- get_pop_counts()[source]
Print out unique population IDs and their counts.
- validate_popmap()[source]
Validate that all alignment sample IDs are present in the population map.
- subset_popmap()[source]
Subset the population map based on inclusion and exclusion criteria.
- _infer_delimiter()[source]
Infer the delimiter of a given file.
- _infer_header()[source]
Infer whether the file has a header.
- _is_numeric()[source]
Check if a string can be converted to a float.
- _validate_pop_subset_lists()[source]
Validates the elements in the given list to ensure they are all of type str.
- _flip_dictionary()[source]
Flip the keys and values of a dictionary.
- get_pop_counts(genotype_data)[source]
Print out unique population IDs and their counts.
Prints the unique population IDs along with their respective counts. It also generates a plot of the population counts.
- Parameters:
genotype_data (GenotypeData) – GenotypeData object containing the alignment data.
- Return type:
None
- property popmap: Dict[str, str]
Get the population dictionary.
- Returns:
Dictionary with SampleIDs as keys and the corresponding population ID as values.
- Return type:
Dict[str, str]
- property popmap_flipped: Dict[str, List[str]]
Associate unique populations with lists of SampleIDs.
- Returns:
Dictionary with unique populations as keys and lists of associated SampleIDs as values.
- Return type:
Dict[str, List[str]]
- read_popmap()[source]
Read a population map file from disk into a dictionary object.
The dictionary will have SampleIDs as keys and the associated population ID as the values. The population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs. There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs. Alternatively, the header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’). The population map file should not contain any duplicate SampleIDs.
- Raises:
FileNotFoundError – Raises an exception if the population map file is not found on disk.
ValueError – Raises an exception if the population map file is empty or if the data cannot be correctly loaded from the file.
AssertionError – Raises an exception if the population map file is empty or if the data cannot be correctly loaded from the file.
- Return type:
None
Note
This method will be executed upon initialization of the ReadPopmap object.
The population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs.
There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs.
Alternatively, the header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’).
The population map file should not contain any duplicate SampleIDs.
The dictionary will have SampleIDs as keys and the associated population ID as the values.
- property sample_indices: ndarray
Get the indices of the subset samples from the population map as a boolean array.
- Returns:
Boolean array representing the subset samples.
- Return type:
np.ndarray
- subset_popmap(samples, include, exclude)[source]
Subset the population map based on inclusion and exclusion criteria.
Subsets the population map by including only the specified populations (include) and excluding the specified populations (exclude).
- Parameters:
samples (List[str]) – List of samples from alignment.
include (List[str] or None) – List of populations to include in the subset. The populations to include in the subset of the population map.
exclude (List[str] or None) – List of populations to exclude from the subset of the population map.
- Raises:
ValueError – Raises an exception if populations are present in both include and exclude lists.
TypeError – Raises an exception if include or exclude arguments are not lists.
ValueError – Raises an exception if the population map is empty after subsetting.
- Return type:
None
- validate_popmap(samples, force=False)[source]
Validate that all alignment sample IDs are present in the population map.
- Parameters:
samples (List[str]) – List of SampleIDs present in the alignment. The list of SampleIDs to be validated against the population map.
force (bool, optional) – If True, return a subset dictionary without the keys that weren’t found. If False, return a boolean indicating whether all keys were found. Defaults to False.
- Returns:
If force is False, returns True if all alignment samples are present in the population map and all population map samples are present in the alignment. Returns False otherwise. If force is True, returns a subset of the population map containing only the samples present in the alignment.
- Return type:
Union[bool, Dict[str, str]]
- write_popmap(output_file)[source]
Write the population map dictionary to a file.
Writes the population map dictionary, where SampleIDs are keys and the associated population ID are values, to the specified output file.
- Parameters:
output_file (str) – The filename of the output file to write the population map.
- Return type:
None