snpio.read_input package

Submodules

snpio.read_input.genotype_data module

class snpio.read_input.genotype_data.GenotypeData(filename=None, filetype=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, loci_indices=None, sample_indices=None, chunk_size=1000, logger=None, debug=False)[source]

Bases: BaseGenotypeData

A class for handling and analyzing genotype data.

The GenotypeData class is intended as a parent class for the file reader classes, such as VCFReader, StructureReader, and PhylipReader. It provides common methods and attributes for handling genotype data, such as reading population maps, subsetting data, and generating missingness reports.

Note

GenotypeData handles the following characters as missing data:
  • ‘N’

  • ‘-’

  • ‘?’

  • ‘.’

If using PHYLIP or STRUCTURE formats, all sites will be forced to be biallelic. If multiple alleles are needed, you must use a VCF file.

inputs

GenotypeData keyword arguments as a dictionary.

Type:

dict

num_snps

Number of SNPs in the dataset.

Type:

int

num_inds

Number of individuals in the dataset.

Type:

int

populations

Population IDs.

Type:

List[Union[str, int]]

popmap

Dictionary object with SampleIDs as keys and popIDs as values.

Type:

dict

popmap_inverse

Inverse dictionary of popmap, where popIDs are keys and lists of sampleIDs are values.

Type:

dict or None

samples

Sample IDs in input order.

Type:

List[str]

snpsdict

Dictionary with SampleIDs as keys and lists of genotypes as values.

Type:

dict or None

snp_data

Genotype data as a 2D list.

Type:

List[List[str]]

loci_indices

Column indices for retained loci in filtered alignment.

Type:

List[int]

sample_indices

Row indices for retained samples in the alignment.

Type:

List[int]

ref

List of reference alleles of length num_snps.

Type:

List[str]

alt

List of alternate alleles of length num_snps.

Type:

List[str]

iupac_mapping

Mapping of allele tuples to IUPAC codes.

Type:

dict

reverse_iupac_mapping

Mapping of IUPAC codes to allele tuples.

Type:

dict

missing_vals

List of missing value characters.

Type:

List[str]

replace_vals

List of missing value replacements.

Type:

List[pd.NA]

logger

Logger object.

Type:

logging.Logger

debug

If True, display debug messages.

Type:

bool

plot_kwargs

Plotting keyword arguments.

Type:

dict

supported_filetypes

List of supported filetypes.

Type:

List[str]

kwargs

GenotypeData keyword arguments.

Type:

dict

chunk_size

Chunk size for reading in large files.

Type:

int

plot_format

Format to save report plots.

Type:

str

plot_fontsize

Font size for plots.

Type:

int

plot_dpi

Resolution in dots per inch for plots.

Type:

int

plot_despine

If True, remove the top and right spines from plots.

Type:

bool

show_plots

If True, display plots in the console.

Type:

bool

prefix

Prefix to use for output directory.

Type:

str

verbose

If True, display verbose output.

Type:

bool

read_popmap()[source]

Read population map from file to map samples to populations.

subset_with_popmap()[source]

Subset popmap and samples based on population criteria.

write_popmap()[source]

Write the population map to a file.

missingness_reports()[source]

Generate missingness reports and plots.

_make_snpsdict()[source]

Make a dictionary with SampleIDs as keys and a list of SNPs associated with the sample as the values.

_genotype_to_iupac()[source]

Convert a genotype string to its corresponding IUPAC code.

_iupac_to_genotype()[source]

Convert an IUPAC code to its corresponding genotype string.

calc_missing()[source]

Calculate missing value statistics based on a DataFrame.

copy()[source]

Create a deep copy of the GenotypeData object.

read_popmap()[source]

Read in a popmap file.

missingness_reports()[source]

Create missingness reports from GenotypeData object.

_report2file()[source]

Write a DataFrame to a CSV file.

_genotype_to_iupac()[source]

Convert a genotype string to its corresponding IUPAC code.

_iupac_to_genotype()[source]

Convert an IUPAC code to its corresponding genotype string.

get_reverse_iupac_mapping()[source]

Create a reverse mapping from IUPAC codes to allele tuples.

Example

>>> gd = GenotypeData(file="data.vcf", filetype="vcf", popmapfile="popmap.txt")
>>> print(gd.snp_data)
[['A', 'C', 'G', 'T'], ['A', 'C', 'G', 'T'], ['A', 'C', 'G', 'T']]
>>> print(gd.num_snps)
4
>>> print(gd.num_inds)
3
>>> print(gd.populations)
['pop1', 'pop2', 'pop2']
>>> print(gd.popmap)
{'sample1': 'pop1', 'sample2': 'pop2', 'sample3': 'pop2'}
>>> print(gd.samples)
['sample1', 'sample2', 'sample3']
property alt: List[str]

Get list of alternate alleles of length num_snps.

Returns:

List of alternate alleles of length num_snps.

Return type:

List[str]

calc_missing(df, use_pops=True)[source]

Calculate missing value statistics based on a DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing genotype data.

  • use_pops (bool, optional) – If True, calculate statistics per population. Defaults to True.

Returns:

A tuple of missing value statistics:

  • loc (pd.Series): Missing value proportions per locus.

  • ind (pd.Series): Missing value proportions per individual.

  • poploc (Optional[pd.DataFrame]): Missing value proportions per population and locus. Only returned if use_pops=True.

  • poptot (Optional[pd.Series]): Missing value proportions per population. Only returned if use_pops=True.

  • indpop (Optional[pd.DataFrame]): Missing value proportions per individual and population. Only returned if use_pops=True.

Return type:

Tuple[pd.Series, pd.Series, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.DataFrame]]

copy()[source]

Create a deep copy of the GenotypeData or VCFReader object.

Returns:

A new object with the same attributes as the original.

Return type:

GenotypeData or VCFReader

get_reverse_iupac_mapping()[source]

Creates a reverse mapping from IUPAC codes to allele tuples.

Returns:

Mapping of IUPAC codes to allele tuples

Return type:

Dict[str, Tuple[str, str]]

property inputs: Dict[str, Any]

Get GenotypeData keyword arguments as a dictionary.

Returns:

GenotypeData keyword arguments as a dictionary.

Return type:

Dict[str, Any]

property loci_indices: ndarray

Boolean array for retained loci in filtered alignment.

Returns:

Boolean array of loci indices, with True for retained loci and False for excluded loci.

Return type:

np.ndarray

Raises:
  • TypeError – If the loci_indices attribute is not a numpy.ndarray or list.

  • TypeError – If the loci_indices attribute is not a numpy.dtype ‘bool’.

missingness_reports(prefix=None, zoom=True, bar_color='gray', heatmap_palette='magma')[source]

Generate missingness reports and plots.

The function will write several comma-delimited report files:

  1. individual_missingness.csv: Missing proportions per individual.

  2. locus_missingness.csv: Missing proportions per locus.

  3. population_missingness.csv: Missing proportions per population (only generated if popmapfile was passed to GenotypeData).

  4. population_locus_missingness.csv: Table of per-population and per-locus missing data proportions.

A file missingness.<plot_format> will also be saved. It contains the following subplots:

  1. Barplot with per-individual missing data proportions.

  2. Barplot with per-locus missing data proportions.

  3. Barplot with per-population missing data proportions (only if popmapfile was passed to GenotypeData).

  4. Heatmap showing per-population + per-locus missing data proportions (only if popmapfile was passed to GenotypeData).

  5. Stacked barplot showing missing data proportions per individual.

  6. Stacked barplot showing missing data proportions per population (only if popmapfile was passed to GenotypeData).

If popmapfile was not passed to GenotypeData, then the subplots and report files that require populations are not included.

Parameters:
  • prefix (str, optional) – Output file prefix for the missingness report. Defaults to None.

  • zoom (bool, optional) – If True, zoom in to the missing proportion range on some of the plots. If False, the plot range is fixed at [0, 1]. Defaults to True.

  • bar_color (str, optional) – Color of the bars on the non-stacked bar plots. Can be any color supported by matplotlib. See the matplotlib.pyplot.colors documentation. Defaults to ‘gray’.

  • heatmap_palette (str, otpional) – Color palette for the heatmap plot. Defaults to ‘magma’.

Return type:

None

property num_inds: int

Number of individuals (samples) in dataset.

Returns:

Number of individuals (samples) in input data.

Return type:

int

property num_snps: int

Number of snps (loci) in the dataset.

Returns:

Number of SNPs (loci) per individual.

Return type:

int

property popmap: Dict[str, str]

Dictionary object with SampleIDs as keys and popIDs as values.

Returns:

Dictionary with SampleIDs as keys and popIDs as values

Return type:

Dict[str, str]

property popmap_inverse: Dict[str, List[str]]

Inverse popmap dictionary with populationIDs as keys and lists of sampleIDs as values.

Returns:

Inverse dictionary of popmap, where popIDs are keys and lists of sampleIDs are values.

Return type:

Dict[str, List[str]]

property populations: List[str | int]

Population IDs as a list of strings or integers.

Returns:

Population IDs.

Return type:

List[Union[str, int]]

read_popmap()[source]

Read population map from file to map samples to populations.

Makes use of the ReadPopmap class to read in the popmap file and validate the samples against the alignment.

Return type:

None

Sets the following attributes:
  • samples

  • populations

  • popmap

  • popmap_inverse

  • sample_indices

property ref: List[str]

Get list of reference alleles of length num_snps.

Returns:

List of reference alleles of length num_snps.

Return type:

List[str]

property sample_indices: ndarray

Row indices for retained samples in alignemnt.

Returns:

Boolean array of sample indices, with True for retained samples and False for excluded samples.

Return type:

np.ndarray

Raises:
  • TypeError – If the sample_indices attribute is not a numpy.ndarray or list.

  • TypeError – If the sample_indices attribute is not a numpy.dtype ‘bool’.

property samples: List[str]

List of sample IDs in input order.

Returns:

Sample IDs in input order.

Return type:

List[str]

set_alignment(snp_data, samples, sample_indices, loci_indices)[source]

Set the alignment data and sample IDs.

Parameters:
  • snp_data (np.ndarray) – 2D array of genotype data.

  • samples (List[str]) – List of sample IDs.

  • sample_indices (np.ndarray) – Boolean array of sample indices.

  • loci_indices (np.ndarray) – Boolean array of locus indices.

Return type:

None

Note

This method is used to set the alignment data and sample IDs after filtering.

The method updates the following attributes:
  • snp_data

  • samples

  • populations

  • popmap

  • popmap_inverse

  • sample_indices

  • loci_indices

  • num_inds

  • num_snps

  • prefix

property snp_data: ndarray

Get the genotypes as a 2D list of shape (n_samples, n_loci).

Returns:

2D array of IUPAC encoded genotype data.

Return type:

np.ndarray

Raises:

TypeError – If the snp_data attribute is not a numpy.ndarray, pandas.DataFrame, or list.

property snpsdict: Dict[str, List[str]]

Dictionary with Sample IDs as keys and lists of genotypes as values.

Returns:

Dictionary with sample IDs as keys and lists of genotypes as values.

Return type:

Dict[str, List[str]]

subset_with_popmap(my_popmap, samples, force, include_pops=None, exclude_pops=None, return_indices=False)[source]

Subset popmap and samples based on population criteria.

Parameters:
  • my_popmap (ReadPopmap) – ReadPopmap instance.

  • samples (List[str]) – List of sample IDs.

  • force (bool) – If True, force the subsetting. If False, raise an error if the samples don’t align.

  • include_pops (Optional[List[str]]) – List of populations to include. If provided, only samples belonging to these populations will be retained.

  • exclude_pops (Optional[List[str]]) – List of populations to exclude. If provided, samples belonging to these populations will be excluded.

  • return_indices (bool, optional) – If True, return the indices for samples. Defaults to False.

Returns:

Boolean array of sample_indices if return_indices is True. Otherwise, None.

Return type:

Optional[np.ndarray]

write_popmap(filename)[source]

Write the population map to a file.

Parameters:

filename (str) – Output file path.

Raises:
  • AttributeError – If the samples attribute is not defined.

  • AttributeError – If the popmap attribute is not defined.

Return type:

None

snpio.read_input.popmap_file module

class snpio.read_input.popmap_file.ReadPopmap(filename, logger, verbose=False)[source]

Bases: object

Class to read and parse a population map file.

Population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs. There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs.

Alternatively, tthe header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’).

The population map file should not contain any duplicate SampleIDs.

Example

Example population map file format:

` sampleID,populationID Sample1,Population1 Sample2,Population1 Sample3,Population2 Sample4,Population2 `

>>> from snpio.read_input.popmap_file import ReadPopmap
>>> pm = ReadPopmap("popmap.txt", logger, verbose=True)
>>> pm.get_pop_counts(genotype_data)
>>> pm.validate_popmap(samples, force=True)
>>> pm.subset_popmap(samples, include=["Population1"])
>>> pm.write_popmap("subset_popmap.txt")
>>> print(pm.popmap)
{'Sample1': 'Population1', 'Sample2': 'Population1'}
>>> print(pm.inverse_popmap):
{'Population1': ['Sample1', 'Sample2']}
filename

Filename for the population map.

Type:

str

verbose

Verbosity setting (True or False). If True, enables verbose output. If False, suppresses verbose output.

Type:

bool

_popdict

Dictionary with SampleIDs as keys and the corresponding population ID as values.

Type:

Dict[str, str]

_sample_indices

Boolean array representing the subset samples.

Type:

np.ndarray

logger

Logger object.

Type:

logging

read_popmap()[source]

Read a population map file from disk into a dictionary object.

write_popmap()[source]

Write the population map dictionary to a file.

get_pop_counts()[source]

Print out unique population IDs and their counts.

validate_popmap()[source]

Validate that all alignment sample IDs are present in the population map.

subset_popmap()[source]

Subset the population map based on inclusion and exclusion criteria.

_infer_delimiter()[source]

Infer the delimiter of a given file.

_infer_header()[source]

Infer whether the file has a header.

_is_numeric()[source]

Check if a string can be converted to a float.

_validate_pop_subset_lists()[source]

Validates the elements in the given list to ensure they are all of type str.

_flip_dictionary()[source]

Flip the keys and values of a dictionary.

get_pop_counts(genotype_data)[source]

Print out unique population IDs and their counts.

Prints the unique population IDs along with their respective counts. It also generates a plot of the population counts.

Parameters:

genotype_data (GenotypeData) – GenotypeData object containing the alignment data.

Return type:

None

property popmap: Dict[str, str]

Get the population dictionary.

Returns:

Dictionary with SampleIDs as keys and the corresponding population ID as values.

Return type:

Dict[str, str]

property popmap_flipped: Dict[str, List[str]]

Associate unique populations with lists of SampleIDs.

Returns:

Dictionary with unique populations as keys and lists of associated SampleIDs as values.

Return type:

Dict[str, List[str]]

read_popmap()[source]

Read a population map file from disk into a dictionary object.

The dictionary will have SampleIDs as keys and the associated population ID as the values. The population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs. There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs. Alternatively, the header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’). The population map file should not contain any duplicate SampleIDs.

Raises:
  • FileNotFoundError – Raises an exception if the population map file is not found on disk.

  • ValueError – Raises an exception if the population map file is empty or if the data cannot be correctly loaded from the file.

  • AssertionError – Raises an exception if the population map file is empty or if the data cannot be correctly loaded from the file.

Return type:

None

Note

This method will be executed upon initialization of the ReadPopmap object.

The population map file should contain two comma or whitespace-delimited columns, with one being the SampleIDs and the other being the associated populationIDs.

There should either not be a header line in the popmap file, in which case the column order should be sampleIDs and then populationIDs.

Alternatively, the header line should contain exactly one of the accepted sampleID column names (‘sampleid’ or ‘sampleids’) and exactly one of the accepted populationID column names (‘populationid’, ‘populationids’, ‘popid’, or ‘popids’).

The population map file should not contain any duplicate SampleIDs.

The dictionary will have SampleIDs as keys and the associated population ID as the values.

property sample_indices: ndarray

Get the indices of the subset samples from the population map as a boolean array.

Returns:

Boolean array representing the subset samples.

Return type:

np.ndarray

subset_popmap(samples, include, exclude)[source]

Subset the population map based on inclusion and exclusion criteria.

Subsets the population map by including only the specified populations (include) and excluding the specified populations (exclude).

Parameters:
  • samples (List[str]) – List of samples from alignment.

  • include (List[str] or None) – List of populations to include in the subset. The populations to include in the subset of the population map.

  • exclude (List[str] or None) – List of populations to exclude from the subset of the population map.

Raises:
  • ValueError – Raises an exception if populations are present in both include and exclude lists.

  • TypeError – Raises an exception if include or exclude arguments are not lists.

  • ValueError – Raises an exception if the population map is empty after subsetting.

Return type:

None

validate_popmap(samples, force=False)[source]

Validate that all alignment sample IDs are present in the population map.

Parameters:
  • samples (List[str]) – List of SampleIDs present in the alignment. The list of SampleIDs to be validated against the population map.

  • force (bool, optional) – If True, return a subset dictionary without the keys that weren’t found. If False, return a boolean indicating whether all keys were found. Defaults to False.

Returns:

If force is False, returns True if all alignment samples are present in the population map and all population map samples are present in the alignment. Returns False otherwise. If force is True, returns a subset of the population map containing only the samples present in the alignment.

Return type:

Union[bool, Dict[str, str]]

write_popmap(output_file)[source]

Write the population map dictionary to a file.

Writes the population map dictionary, where SampleIDs are keys and the associated population ID are values, to the specified output file.

Parameters:

output_file (str) – The filename of the output file to write the population map.

Return type:

None

Module contents