snpio.io package

Submodules

snpio.io.phylip_reader module

class snpio.io.phylip_reader.PhylipReader(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Bases: GenotypeData

Class to read and write PHYLIP files.

This class provides methods to read and write PHYLIP files. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data. The sequence data can be in any format, but it is typically a string of nucleotides or amino acids.

Example

>>> from snpio import PhylipReader
>>>
>>> phylip = PhylipReader(filename="example.phy", popmapfile="example.popmap", verbose=True)
>>>
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["Sample1", "Sample2", "Sample3", "Sample4"]
>>>
>>> genotype_data.populations
["Pop1", "Pop1", "Pop2", "Pop2"]
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.popmap
>>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"}
>>>
>>> genotype_data.popmap_inverse
{"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]}
>>>
>>> genotype_data.ref
["A", "C", "A"]
>>>
>>> genotype_data.alt
["T", "G", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_phylip("output.str")
filename

Name of the PHYLIP file.

Type:

str

popmapfile

Name of the population map file.

Type:

str

force_popmap

If True, the population map file is required.

Type:

bool

exclude_pops

List of populations to exclude.

Type:

List[str]

include_pops

List of populations to include.

Type:

List[str]

plot_format

Format for saving plots. Default is ‘png’.

Type:

str

prefix

Prefix for output files.

Type:

str

verbose

If True, status updates are printed.

Type:

bool

samples

List of sample IDs.

Type:

List[str]

snp_data

List of SNP data.

Type:

List[List[str]]

num_inds

Number of individuals.

Type:

int

num_snps

Number of SNPs.

Type:

int

logger

Logger instance.

Type:

Logger

debug

If True, debug messages are printed.

Type:

bool

property alt: List[str]

List of alternate alleles.

property alt2: List[List[str]]

List of second alternate alleles.

load_aln()[source]

Load the PHYLIP file and populate SNP data, samples, and alleles.

This method reads the PHYLIP file and populates the SNP data, samples, and alleles. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data. The sequence data can be in any format, but it is typically a string of nucleotides or amino acids.

Raises:
  • AlignmentFileNotFoundError – If the PHYLIP file is not found.

  • AlignmentFormatError – If the PHYLIP file has an invalid format.

Return type:

None

property ref: List[str]

List of reference alleles.

write_phylip(output_file, genotype_data=None, snp_data=None, samples=None)[source]

Write the stored alignment as a PHYLIP file.

This method writes the stored alignment as a PHYLIP file. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

Parameters:
  • output_file (str) – Name of the output PHYLIP file.

  • genotype_data (GenotypeData, optional) – GenotypeData instance.

  • snp_data (List[List[str]], optional) – SNP data. Must be provided if genotype_data is None.

  • samples (List[str], optional) – List of sample IDs. Must be provided if snp_data is not None.

Raises:
  • TypeError – If genotype_data and snp_data are both provided.

  • TypeError – If samples are not provided when snp_data is provided.

  • ValueError – If samples and snp_data are not the same length.

Return type:

None

Note

If genotype_data is provided, the snp_data and samples are loaded from the GenotypeData instance.

If snp_data is provided, the samples must also be provided.

If genotype_data is not provided, the snp_data and samples must be provided.

The sequence data must have the same length for each sample.

The PHYLIP file must have the correct number of samples and loci.

snpio.io.structure_reader module

class snpio.io.structure_reader.StructureReader(filename=None, popmapfile=None, has_popids=False, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Bases: GenotypeData

This class reads STRUCTURE files and stores the SNP data, sample IDs, and populations.

StructureReader is a subclass of GenotypeData and inherits its attributes and methods. It reads STRUCTURE files and stores the SNP data, sample IDs, and populations, as well as various other attributes.

Example

>>> from snpio import StructureReader
>>>
>>> genotype_data = StructureReader(filename="data.structure", popmapfile="example.popmap", verbose=True)
>>>
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["Sample1", "Sample2", "Sample3", "Sample4"]
>>>
>>> genotype_data.populations
["Pop1", "Pop1", "Pop2", "Pop2"]
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.popmap
>>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"}
>>>
>>> genotype_data.popmap_inverse
{"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]}
>>>
>>> genotype_data.ref
["A", "C", "A"]
>>>
>>> genotype_data.alt
["T", "G", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_structure("output.str")
logger

Logger object.

Type:

LoggerManager

verbose

If True, status updates are printed.

Type:

bool

debug

If True, debug messages are printed.

Type:

bool

_has_popids

If True, the STRUCTURE file includes population IDs.

Type:

bool

_onerow

If True, the STRUCTURE file is in one-row format.

Type:

bool

load_aln()[source]

Load the STRUCTURE file and populate SNP data, samples, and populations.

This method reads the STRUCTURE file and populates the SNP data, sample IDs, and populations. It also sets the number of SNPs and individuals.

Raises:
  • AlignmentNotFoundError – If the STRUCTURE file is not found.

  • AlignmentFormatError – If the STRUCTURE file has an invalid format.

Return type:

None

Note

This method should be called after initializing the StructureReader object.

The SNP data, sample IDs, and populations are stored in the snp_data, samples, and populations attributes, respectively.4

The number of SNPs and individuals are stored in the num_snps and num_inds attributes, respectively.

The STRUCTURE file can be written to a new file using the write_structure method.

The SNP data can be accessed in the IUPAC format using the snp_data attribute.

write_structure(output_file, genotype_data=None, snp_data=None, samples=None, verbose=False)[source]

Write the stored alignment as a STRUCTURE file.

This method writes the stored alignment as a STRUCTURE file. If genotype_data is provided, the SNP data and sample IDs are extracted from it. Otherwise, the SNP data and sample IDs must be provided.

Parameters:
  • output_file (str) – Name of the output STRUCTURE file.

  • genotype_data (GenotypeData, optional) – GenotypeData instance.

  • snp_data (List[List[str]], optional) – SNP data in IUPAC format. Must be provided if genotype_data is None.

  • samples (List[str]], optional) – List of sample IDs. Must be provided if snp_data is not provided.

  • verbose (bool, optional) – If True, status updates are printed.

Raises:
  • TypeError – If genotype_data and snp_data are both provided.

  • TypeError – If samples are not provided when snp_data is provided.

Return type:

None

snpio.io.vcf_reader module

class snpio.io.vcf_reader.VCFReader(filename=None, popmapfile=None, chunk_size=1000, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, sample_indices=None, loci_indices=None, debug=False)[source]

Bases: GenotypeData

A class to read VCF files into GenotypeData objects and write GenotypeData objects to VCF files.

This class inherits from GenotypeData and provides methods to read VCF files and extract the necessary attributes.

Example

>>> from snpio import VCFReader
>>>
>>> genotype_data = VCFReader(filename="example.vcf", popmapfile="popmap.txt", verbose=True)
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["A", "T", "T", "A"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["sample1", "sample2", "sample3", "sample4"]
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.populations
["pop1", "pop1", "pop2", "pop2"]
>>>
>>> genotype_data.popmap
{"sample1": "pop1", "sample2": "pop1", "sample3": "pop2", "sample4":
"pop2"}
>>>
>>> genotype_data.popmap_inverse
{"pop1": ["sample1", "sample2"], "pop2": ["sample3", "sample4"]}
>>>
>>> genotype_data.loci_indices
array([True, True, True], dtype=bool)
>>>
>>> genotype_data.sample_indices
array([True, True, True, True], dtype=bool)
>>>
>>> genotype_data.ref
["A", "A", "A"]
>>>
>>> genotype_data.alt
["T", "T", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_vcf("output.vcf")
filename

The name of the VCF file to read.

Type:

Optional[str]

popmapfile

The name of the population map file to read.

Type:

Optional[str]

chunk_size

The size of the chunks to read from the VCF file.

Type:

int

force_popmap

Whether to force the use of the population map file.

Type:

bool

exclude_pops

The populations to exclude.

Type:

Optional[List[str]]

include_pops

The populations to include.

Type:

Optional[List[str]]

plot_format

The format to save the plots in.

Type:

str

plot_fontsize

The font size for the plots.

Type:

int

plot_dpi

The DPI for the plots.

Type:

int

plot_despine

Whether to remove the spines from the plots.

Type:

bool

show_plots

Whether to show the plots.

Type:

bool

prefix

The prefix to use for the output files.

Type:

str

verbose

Whether to print verbose output.

Type:

bool

sample_indices

The indices of the samples to read.

Type:

np.ndarray

loci_indices

The indices of the loci to read.

Type:

np.ndarray

debug

Whether to enable debug mode.

Type:

bool

num_records

The number of records in the VCF file.

Type:

int

filetype

The type of the file.

Type:

str

vcf_header

The VCF header.

Type:

Optional[pysam.libcbcf.VariantHeader]

info_fields

The VCF info fields.

Type:

Optional[List[str]]

resource_data

A dictionary to store resource data.

Type:

dict

logger

The logger object.

Type:

logging.Logger

snp_data

The SNP data.

Type:

np.ndarray

samples

The sample names.

Type:

np.ndarray

Note

The VCF file is bgzipped, sorted, and indexed using Tabix to ensure efficient reading, if necessary.

The VCF file is read in chunks to avoid memory issues.

The VCF attributes are extracted and stored in an HDF5 file for efficient access.

The genotype data is transformed into IUPAC codes for efficient storage and processing.

The VCF attributes are stored in an HDF5 file for efficient access.

The VCF attributes are extracted and stored in an HDF5 file for efficient access.

The genotype data is transformed into IUPAC codes for efficient storage and processing.

The VCF attributes are stored in an HDF5 file for efficient access.

get_vcf_attributes(vcf, chunk_size=1000)[source]

Extracts VCF attributes and returns them in an efficient manner with chunked processing.

Parameters:
  • vcf (pysam.VariantFile) – The VCF file object to extract attributes from.

  • chunk_size (int, optional) – The size of the chunks to process. Defaults to 1000.

Returns:

The path to the HDF5 file containing the VCF attributes, the SNP data, and the sample names.

Return type:

Tuple[str, np.ndarray, np.ndarray]

load_aln()[source]

Loads the alignment from the VCF file into the VCFReader object.

This method ensures that the input VCF file is bgzipped, sorted, and indexed using Tabix. It then reads the VCF file and extracts the necessary attributes. The VCF attributes are stored in an HDF5 file for efficient access. The genotype data is transformed into IUPAC codes for efficient storage and processing.

Return type:

None

transform_gt(gt, ref, alts)[source]

Transforms genotype tuples into their IUPAC codes or corresponding strings.

Parameters:
  • gt (np.ndarray) – The genotype tuples to transform.

  • ref (str) – The reference allele.

  • alts (List[str]) – The alternate alleles.

Returns:

The transformed genotype tuples.

Return type:

np.ndarray

update_vcf_attributes(snp_data, sample_indices, loci_indices, samples)[source]

Updates the VCF attributes with new data in chunks.

Parameters:
  • snp_data (np.ndarray) – The SNP data to update the VCF attributes with.

  • sample_indices (np.ndarray) – The indices of the samples to update.

  • loci_indices (np.ndarray) – The indices of the loci to update.

  • samples (np.ndarray) – The sample names to update.

Raises:

FileNotFoundError – If the VCF attributes file is not found.

Return type:

None

validate_input_vcf(filepath)[source]

Validates the input VCF file to ensure it meets required criteria.

Parameters:

filepath (Path) – The path to the bgzipped and sorted VCF file.

Raises:

ValueError – If the VCF file does not meet validation criteria.

Return type:

None

validate_output_vcf(filepath)[source]

Validates the output VCF file to ensure it was written correctly.

Parameters:

filepath (Path) – The path to the bgzipped and indexed output VCF file.

Raises:

ValueError – If the VCF file does not meet validation criteria.

Return type:

None

property vcf_attributes_fn: str

The path to the HDF5 file containing the VCF attributes.

Returns:

The path to the HDF5 file containing the VCF attributes.

Return type:

str

write_vcf(output_filename, hdf5_file_path=None, chunk_size=1000)[source]

Writes the GenotypeData object data to a VCF file in chunks.

This method writes the VCF data, bgzips the output file, indexes it with Tabix, and validates the output.

Parameters:
  • output_filename (str) – The name of the output VCF file to write.

  • hdf5_file_path (str, optional) – The path to the HDF5 file containing the VCF attributes. Defaults to None.

  • chunk_size (int, optional) – The size of the chunks to read from the HDF5 file. Defaults to 1000.

Returns:

The current instance of VCFReader.

Return type:

VCFReader

Module contents