snpio.filtering package

Submodules

snpio.filtering.nremover2 module

class snpio.filtering.nremover2.NRemover2(genotype_data)[source]

Bases: object

A class for filtering alignments based on various criteria.

The class can filter out sequences (samples) and loci (columns) that exceed a missing data threshold (per-column and per-population), minor allele frequency, minor allele count, and other criteria. It can also filter out monomorphic sites, singletons, and loci with more than two alleles. Finally, it can removed all but one linked locus, thin out loci within a specified distance of each other, and randomly subset loci in the SNP dataset. The class provides a flexible and extensible framework for filtering genetic data alignments based on user-defined criteria. It can be used to clean up SNP datasets, remove low-quality loci, and prepare data for downstream analyses.

Note

NRemover2 handles the following characters as missing data:
  • ‘N’

  • ‘-’

  • ‘?’

  • ‘.’

Thus, it treats gaps as missing data. Please keep this in mind when using NRemover2.

The class is designed to be used with the GenotypeData class, which contains the genetic data alignment, population map, and populations (if a popmap is provided).

The filtering classes use either a threshold or a boolean value to determine whether to consider heterozygous gentoypes in the filtering logic. The following methods use the exclude_heterozygous parameter:

  • filter_monomorphic

  • filter_singletons

  • filter_biallelic

If exclude_heterozygous is set to True, the filtering methods will exclude heterozygous genotypes from the filtering logic. If set to False (default), heterozygous genotypes will be included in the filtering logic.

The class can be used to search for optimal filtering thresholds by plotting the proportion of missing data against the filtering thresholds. The search_thresholds() method can be used to search across various combinations of filtering thresholds and plot the results.

The class can also be used to thin out loci within a specified distance of each other using the thin_loci method.

The class can be used to randomly subset the loci (columns) in the SNP dataset using the random_subset_loci method.

The class can be used to filter out linked loci using the VCF file CHROM field using the filter_linked method.

The class can be used to plot a Sankey diagram showing the number of loci removed at each filtering step using the plot_sankey_filtering_report method.

The class can be used to print a summary of the filtering results using the print_filtering_report method.

The class can be used to filter out monomorphic sites using the filter_monomorphic method.

The class can be used to filter out loci (columns) where the only variant is a singleton using the filter_singletons method.

The class can be used to filter out loci (columns) that have more than 2 alleles using the filter_biallelic method.

The class can be used to filter out loci (columns) where the minor allele frequency is below the threshold using the filter_maf method.

The class can be used to filter out loci (columns) where the minor allele count is below the threshold using the filter_mac method.

The class can be used to filter out loci (columns) from the alignment that have more than a given proportion of missing data using the filter_missing method.

The class can be used to filter out sequences from the alignment that have more than a given proportion of missing data using the filter_missing_sample method.

The class can be used to filter out loci (columns) from the alignment that have more than a given proportion of missing data in a specific population using the filter_missing_pop method.

Example

>>> from snpio import VCFReader
>>>
>>> # Specify the genetic data from a VCF file
>>> vcf_file = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
>>>
>>> # Specify the population map file
>>> popmap_file = "snpio/example_data/popmaps/phylogen_nomx.popmap"
>>>
>>> # Read the genetic data from the VCF file
>>>
>>> gd = VCFReader(filename=vcf_file, popmapfile=popmap_file)
>>>
>>> # Initialize the NRemover2 class with the GenotypeData instance
>>> nrm = NRemover2(gd)
>>>
>>> # Filter samples and loci.
>>> nrm.filter_missing_sample(0.75).filter_.filter_missing(0.75).filter_missing_pop(0.75).filter_mac(2).filter_monomorphic(exclude_heterozygous=False).filter_singletons(exclude_heterozygous=False).filter_biallelic(exclude_heterozygous=False).resolve()
>>>
>>> # Plot the Sankey diagram showing the number of loci removed at each filtering step.
>>> nrm.plot_sankey_filtering_report()
>>>
>>> # Run a threshold search and plot the results.
>>> nrm.search_thresholds(thresholds=[0.1, 0.2, 0.3, 0.4, 0.5], maf_thresholds=[0.01, 0.05, 0.1], mac_thresholds=[2, 3, 4, 5], filter_order=["filter_missing_sample", "filter_missing", "filter_missing_pop", "filter_maf", "filter_mac", "filter_monomorphic", "filter_singletons", "filter_biallelic"])
genotype_data

An instance of the GenotypeData class.

Type:

GenotypeData

filtering_helper

An instance of the FilteringHelper class.

Type:

FilteringHelper

filtering_methods

An instance of the FilteringMethods class.

Type:

FilteringMethods

df_sample_list

A list of DataFrames containing filtering results for samples.

Type:

List[pd.DataFrame]

df_global_list

A list of DataFrames containing global filtering results.

Type:

List[pd.DataFrame]

_chain_active

A boolean flag indicating whether a filtering chain is active.

Type:

bool

_chain_resolved

A boolean flag indicating whether a filtering chain has been resolved.

Type:

bool

_search_mode

A boolean flag indicating whether the search mode is active.

Type:

bool

debug

A boolean flag indicating whether to enable debug mode.

Type:

bool

verbose

A boolean flag indicating whether to enable verbose mode.

Type:

bool

logger

An instance of the Logger class for logging messages.

Type:

Logger

alignment

The input alignment to filter.

Type:

np.ndarray

populations

The population for each sequence in the alignment.

Type:

List[str]

samples

The sample IDs for each sequence in the alignment.

Type:

List[str]

prefix

The prefix for the output files.

Type:

str

popmap

A dictionary mapping sample IDs to population names.

Type:

Dict[str, Union[str, int]]

popmap_inverse

A dictionary mapping population names to lists of sample IDs.

Type:

Dict[Union[str, int], List[str]]

sample_indices

A boolean array indicating which samples to keep.

Type:

np.ndarray

loci_indices

A boolean array indicating which loci to keep.

Type:

np.ndarray

loci_removed_per_step

A dictionary tracking the number of loci removed at each filtering step.

Type:

Dict[str, Tuple[int, int]]

samples_removed_per_step

A dictionary tracking the number of samples removed at each filtering step.

Type:

Dict[str, Tuple[int, int]]

kept_per_step

A dictionary tracking the number of loci or samples kept at each filtering step.

Type:

Dict[str, Tuple[int, float]]

step_index

The current step index in the filtering process.

Type:

int

current_threshold

The current threshold for missing data.

Type:

float

original_loci_count

The original number of loci in the alignment.

Type:

int

original_sample_count

The original number of samples in the alignment.

Type:

int

original_loci_indices

A boolean array indicating the original loci indices.

Type:

np.ndarray

original_sample_indices

A boolean array indicating the original sample indices.

Type:

np.ndarray

filter_missing()

Filters out sequences from the alignment that have more than a given proportion of missing data.

filter_missing_pop()

Filters out sequences from the alignment that have more than a given proportion of missing data in a specific population.

filter_missing_sample()

Filters out samples from the alignment that have more than a given proportion of missing data.

filter_maf()

Filters out loci (columns) where the minor allele frequency is below the threshold.

filter_monomorphic()

Filters out monomorphic sites.

filter_singletons()

Filters out loci (columns) where the only variant is a singleton.

filter_biallelic()

Filter out loci (columns) that have more than 2 alleles.

filter_linked()

Filter out linked loci using VCF file CHROM field.

thin_loci()

Thin out loci within a specified distance of each other.

random_subset_loci()

Randomly subset the loci (columns) in the SNP dataset.

search_thresholds()

Plots the proportion of missing data against the filtering thresholds.

plot_sankey_filtering_report()[source]

Makes a Sankey plot showing the number of loci removed at each filtering step.

print_filtering_report()

Prints a summary of the filtering results.

resolve()[source]

Finalizes the method chain and returns the updated GenotypeData instance.

__repr__()[source]

Returns a string representation of the NRemover2 instance.

__str__()[source]

Returns a string representation of the NRemover2 instance.

__getattr__()[source]

Custom attribute access method that handles delegating calls to FilteringMethods or FilteringHelper.

property loci_indices: ndarray

Gets the current loci_indices.

Returns:

The boolean array indicating which loci to keep.

Return type:

np.ndarray

plot_sankey_filtering_report()[source]

Plots a Sankey diagram showing the number of loci removed at each filtering step.

This method generates a Sankey diagram showing the number of loci removed at each filtering step. The diagram is saved as a PNG file in the output directory. The Sankey diagram provides a visual representation of the filtering process, showing the number of loci removed at each step and the proportion of loci removed relative to the total number of loci in the alignment. It also shows the number of loci that were retained at each step.

Return type:

None

Returns:

None

Raises:

RuntimeError – If the filtering chain has not been resolved.

Note

The Sankey diagram is generated using the Plotting class.

The Sankey diagram is saved as PNG and HTML files in the output directory.

The Sankey diagram shows the number of loci removed at each filtering step and the proportion of loci removed relative to the total number of loci in the alignment.

The Sankey diagram also shows the number of loci that were retained at each step.

The Sankey diagram provides a visual representation of the filtering process, making it easier to understand the impact of each filtering step on the alignment.

The Sankey diagram is useful for visualizing the filtering process and identifying the most effective filtering steps.

Example

To plot the Sankey diagram showing the number of loci removed at each filtering step, use the following code:

>>> gd = GenotypeData("snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz", "snpio/example_data/popmaps/phylogen_nomx.popmap")
>>> nrm = NRemover2(gd)
>>> nrm.filter_missing_sample(0.75).filter_missing(0.75).filter_missing_pop(0.75).filter_mac(2).filter_monomorphic(exclude_heterozygous=False).filter_singletons(exclude_heterozygous=False).filter_biallelic(exclude_heterozygous=False).resolve()
>>> nrm.plot_sankey_filtering_report()
>>> # The Sankey diagram will be saved as a PNG file in the output directory.
propagate_chain()[source]

Propagates the filtering chain to the next step, marking the chain as active.

Raises:

RuntimeError – If the filtering chain has not been resolved.

Return type:

None

Note

  • This method is used to propagate the filtering chain to the next step in the filtering process.

  • It marks the chain as active, allowing further filtering steps to be applied.

  • The chain must be resolved before starting a new chain.

  • The chain is resolved by calling the resolve() method.

resolve()[source]

Resolve the method chain and finalize the filtering process.

This method resolves the method chain and finalizes the filtering process. It applies the selected filters to the alignment, updates the alignment, sample indices, and loci indices based on the filtering results, and resets the chain active flag. It returns the updated GenotypeData instance after filtering has been applied.

Returns:

The updated GenotypeData instance after filtering has been applied.

Return type:

GenotypeData

Note

  • This method is used to finalize the method chain by applying the selected filters to the alignment.

  • It updates the alignment, sample indices, and loci indices based on the filtering results.

  • It also resets the chain active flag and returns the updated GenotypeData instance after filtering has been applied.

property sample_indices: ndarray

Gets the current sample_indices.

Returns:

The boolean array indicating which samples to keep.

Return type:

np.ndarray

property search_mode: bool

Gets the current search mode status.

Returns:

The status of search mode.

Return type:

bool

Module contents