snpio.io package
Submodules
snpio.io.phylip_reader module
- class snpio.io.phylip_reader.PhylipReader(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]
Bases:
GenotypeData
Class to read and write PHYLIP files.
This class provides methods to read and write PHYLIP files. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data. The sequence data can be in any format, but it is typically a string of nucleotides or amino acids.
Example
>>> from snpio import PhylipReader >>> >>> phylip = PhylipReader(filename="example.phy", popmapfile="example.popmap", verbose=True) >>> >>> genotype_data.snp_data array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1") >>> >>> genotype_data.samples ["Sample1", "Sample2", "Sample3", "Sample4"] >>> >>> genotype_data.populations ["Pop1", "Pop1", "Pop2", "Pop2"] >>> >>> genotype_data.num_snps 3 >>> >>> genotype_data.num_inds 4 >>> >>> genotype_data.popmap >>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"} >>> >>> genotype_data.popmap_inverse {"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]} >>> >>> genotype_data.ref ["A", "C", "A"] >>> >>> genotype_data.alt ["T", "G", "T"] >>> >>> genotype_data.missingness_reports() >>> >>> genotype_data.run_pca() >>> >>> genotype_data.write_phylip("output.str")
- filename
Name of the PHYLIP file.
- Type:
str
- popmapfile
Name of the population map file.
- Type:
str
- force_popmap
If True, the population map file is required.
- Type:
bool
- exclude_pops
List of populations to exclude.
- Type:
List[str]
- include_pops
List of populations to include.
- Type:
List[str]
- plot_format
Format for saving plots. Default is ‘png’.
- Type:
str
- prefix
Prefix for output files.
- Type:
str
- verbose
If True, status updates are printed.
- Type:
bool
- samples
List of sample IDs.
- Type:
List[str]
- snp_data
List of SNP data.
- Type:
List[List[str]]
- num_inds
Number of individuals.
- Type:
int
- num_snps
Number of SNPs.
- Type:
int
- logger
Logger instance.
- Type:
Logger
- debug
If True, debug messages are printed.
- Type:
bool
- property alt: List[str]
List of alternate alleles.
- property alt2: List[List[str]]
List of second alternate alleles.
- load_aln()[source]
Load the PHYLIP file and populate SNP data, samples, and alleles.
This method reads the PHYLIP file and populates the SNP data, samples, and alleles. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data. The sequence data can be in any format, but it is typically a string of nucleotides or amino acids.
- Raises:
AlignmentFileNotFoundError – If the PHYLIP file is not found.
AlignmentFormatError – If the PHYLIP file has an invalid format.
- Return type:
None
- property ref: List[str]
List of reference alleles.
- write_phylip(output_file, genotype_data=None, snp_data=None, samples=None)[source]
Write the stored alignment as a PHYLIP file.
This method writes the stored alignment as a PHYLIP file. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.
- Parameters:
output_file (str) – Name of the output PHYLIP file.
genotype_data (GenotypeData, optional) – GenotypeData instance.
snp_data (List[List[str]], optional) – SNP data. Must be provided if genotype_data is None.
samples (List[str], optional) – List of sample IDs. Must be provided if snp_data is not None.
- Raises:
TypeError – If genotype_data and snp_data are both provided.
TypeError – If samples are not provided when snp_data is provided.
ValueError – If samples and snp_data are not the same length.
- Return type:
None
Note
If genotype_data is provided, the snp_data and samples are loaded from the GenotypeData instance.
If snp_data is provided, the samples must also be provided.
If genotype_data is not provided, the snp_data and samples must be provided.
The sequence data must have the same length for each sample.
The PHYLIP file must have the correct number of samples and loci.
snpio.io.structure_reader module
- class snpio.io.structure_reader.StructureReader(filename=None, popmapfile=None, has_popids=False, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]
Bases:
GenotypeData
This class reads STRUCTURE files and stores the SNP data, sample IDs, and populations.
StructureReader
is a subclass of GenotypeData and inherits its attributes and methods. It reads STRUCTURE files and stores the SNP data, sample IDs, and populations, as well as various other attributes.Example
>>> from snpio import StructureReader >>> >>> genotype_data = StructureReader(filename="data.structure", popmapfile="example.popmap", verbose=True) >>> >>> genotype_data.snp_data array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1") >>> >>> genotype_data.samples ["Sample1", "Sample2", "Sample3", "Sample4"] >>> >>> genotype_data.populations ["Pop1", "Pop1", "Pop2", "Pop2"] >>> >>> genotype_data.num_snps 3 >>> >>> genotype_data.num_inds 4 >>> >>> genotype_data.popmap >>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"} >>> >>> genotype_data.popmap_inverse {"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]} >>> >>> genotype_data.ref ["A", "C", "A"] >>> >>> genotype_data.alt ["T", "G", "T"] >>> >>> genotype_data.missingness_reports() >>> >>> genotype_data.run_pca() >>> >>> genotype_data.write_structure("output.str")
- logger
Logger object.
- Type:
LoggerManager
- verbose
If True, status updates are printed.
- Type:
bool
- debug
If True, debug messages are printed.
- Type:
bool
- _has_popids
If True, the STRUCTURE file includes population IDs.
- Type:
bool
- _onerow
If True, the STRUCTURE file is in one-row format.
- Type:
bool
- load_aln()[source]
Load the STRUCTURE file and populate SNP data, samples, and populations.
This method reads the STRUCTURE file and populates the SNP data, sample IDs, and populations. It also sets the number of SNPs and individuals.
- Raises:
AlignmentNotFoundError – If the STRUCTURE file is not found.
AlignmentFormatError – If the STRUCTURE file has an invalid format.
- Return type:
None
Note
This method should be called after initializing the StructureReader object.
The SNP data, sample IDs, and populations are stored in the snp_data, samples, and populations attributes, respectively.4
The number of SNPs and individuals are stored in the num_snps and num_inds attributes, respectively.
The STRUCTURE file can be written to a new file using the write_structure method.
The SNP data can be accessed in the IUPAC format using the snp_data attribute.
- write_structure(output_file, genotype_data=None, snp_data=None, samples=None, verbose=False)[source]
Write the stored alignment as a STRUCTURE file.
This method writes the stored alignment as a STRUCTURE file. If genotype_data is provided, the SNP data and sample IDs are extracted from it. Otherwise, the SNP data and sample IDs must be provided.
- Parameters:
output_file (str) – Name of the output STRUCTURE file.
genotype_data (GenotypeData, optional) – GenotypeData instance.
snp_data (List[List[str]], optional) – SNP data in IUPAC format. Must be provided if genotype_data is None.
samples (List[str]], optional) – List of sample IDs. Must be provided if snp_data is not provided.
verbose (bool, optional) – If True, status updates are printed.
- Raises:
TypeError – If genotype_data and snp_data are both provided.
TypeError – If samples are not provided when snp_data is provided.
- Return type:
None
snpio.io.vcf_reader module
- class snpio.io.vcf_reader.VCFReader(filename=None, popmapfile=None, chunk_size=1000, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, sample_indices=None, loci_indices=None, debug=False)[source]
Bases:
GenotypeData
A class to read VCF files into GenotypeData objects and write GenotypeData objects to VCF files.
This class inherits from GenotypeData and provides methods to read VCF files and extract the necessary attributes.
Example
>>> from snpio import VCFReader >>> >>> genotype_data = VCFReader(filename="example.vcf", popmapfile="popmap.txt", verbose=True) >>> genotype_data.snp_data array([["A", "T", "T", "A"], ["A", "T", "T", "A"], ["A", "T", "T", "A"]], dtype="<U1") >>> >>> genotype_data.samples ["sample1", "sample2", "sample3", "sample4"] >>> >>> genotype_data.num_inds 4 >>> >>> genotype_data.num_snps 3 >>> >>> genotype_data.populations ["pop1", "pop1", "pop2", "pop2"] >>> >>> genotype_data.popmap {"sample1": "pop1", "sample2": "pop1", "sample3": "pop2", "sample4": "pop2"} >>> >>> genotype_data.popmap_inverse {"pop1": ["sample1", "sample2"], "pop2": ["sample3", "sample4"]} >>> >>> genotype_data.loci_indices array([True, True, True], dtype=bool) >>> >>> genotype_data.sample_indices array([True, True, True, True], dtype=bool) >>> >>> genotype_data.ref ["A", "A", "A"] >>> >>> genotype_data.alt ["T", "T", "T"] >>> >>> genotype_data.missingness_reports() >>> >>> genotype_data.run_pca() >>> >>> genotype_data.write_vcf("output.vcf")
- filename
The name of the VCF file to read.
- Type:
Optional[str]
- popmapfile
The name of the population map file to read.
- Type:
Optional[str]
- chunk_size
The size of the chunks to read from the VCF file.
- Type:
int
- force_popmap
Whether to force the use of the population map file.
- Type:
bool
- exclude_pops
The populations to exclude.
- Type:
Optional[List[str]]
- include_pops
The populations to include.
- Type:
Optional[List[str]]
- plot_format
The format to save the plots in.
- Type:
str
- plot_fontsize
The font size for the plots.
- Type:
int
- plot_dpi
The DPI for the plots.
- Type:
int
- plot_despine
Whether to remove the spines from the plots.
- Type:
bool
- show_plots
Whether to show the plots.
- Type:
bool
- prefix
The prefix to use for the output files.
- Type:
str
- verbose
Whether to print verbose output.
- Type:
bool
- sample_indices
The indices of the samples to read.
- Type:
np.ndarray
- loci_indices
The indices of the loci to read.
- Type:
np.ndarray
- debug
Whether to enable debug mode.
- Type:
bool
- num_records
The number of records in the VCF file.
- Type:
int
- filetype
The type of the file.
- Type:
str
- vcf_header
The VCF header.
- Type:
Optional[pysam.libcbcf.VariantHeader]
- info_fields
The VCF info fields.
- Type:
Optional[List[str]]
- resource_data
A dictionary to store resource data.
- Type:
dict
- logger
The logger object.
- Type:
logging.Logger
- snp_data
The SNP data.
- Type:
np.ndarray
- samples
The sample names.
- Type:
np.ndarray
Note
The VCF file is bgzipped, sorted, and indexed using Tabix to ensure efficient reading, if necessary.
The VCF file is read in chunks to avoid memory issues.
The VCF attributes are extracted and stored in an HDF5 file for efficient access.
The genotype data is transformed into IUPAC codes for efficient storage and processing.
The VCF attributes are stored in an HDF5 file for efficient access.
The VCF attributes are extracted and stored in an HDF5 file for efficient access.
The genotype data is transformed into IUPAC codes for efficient storage and processing.
The VCF attributes are stored in an HDF5 file for efficient access.
- get_vcf_attributes(vcf, chunk_size=1000)[source]
Extracts VCF attributes and returns them in an efficient manner with chunked processing.
- Parameters:
vcf (pysam.VariantFile) – The VCF file object to extract attributes from.
chunk_size (int, optional) – The size of the chunks to process. Defaults to 1000.
- Returns:
The path to the HDF5 file containing the VCF attributes, the SNP data, and the sample names.
- Return type:
Tuple[str, np.ndarray, np.ndarray]
- load_aln()[source]
Loads the alignment from the VCF file into the VCFReader object.
This method ensures that the input VCF file is bgzipped, sorted, and indexed using Tabix. It then reads the VCF file and extracts the necessary attributes. The VCF attributes are stored in an HDF5 file for efficient access. The genotype data is transformed into IUPAC codes for efficient storage and processing.
- Return type:
None
- transform_gt(gt, ref, alts)[source]
Transforms genotype tuples into their IUPAC codes or corresponding strings.
- Parameters:
gt (np.ndarray) – The genotype tuples to transform.
ref (str) – The reference allele.
alts (List[str]) – The alternate alleles.
- Returns:
The transformed genotype tuples.
- Return type:
np.ndarray
- update_vcf_attributes(snp_data, sample_indices, loci_indices, samples)[source]
Updates the VCF attributes with new data in chunks.
- Parameters:
snp_data (np.ndarray) – The SNP data to update the VCF attributes with.
sample_indices (np.ndarray) – The indices of the samples to update.
loci_indices (np.ndarray) – The indices of the loci to update.
samples (np.ndarray) – The sample names to update.
- Raises:
FileNotFoundError – If the VCF attributes file is not found.
- Return type:
None
- validate_input_vcf(filepath)[source]
Validates the input VCF file to ensure it meets required criteria.
- Parameters:
filepath (Path) – The path to the bgzipped and sorted VCF file.
- Raises:
ValueError – If the VCF file does not meet validation criteria.
- Return type:
None
- validate_output_vcf(filepath)[source]
Validates the output VCF file to ensure it was written correctly.
- Parameters:
filepath (Path) – The path to the bgzipped and indexed output VCF file.
- Raises:
ValueError – If the VCF file does not meet validation criteria.
- Return type:
None
- property vcf_attributes_fn: str
The path to the HDF5 file containing the VCF attributes.
- Returns:
The path to the HDF5 file containing the VCF attributes.
- Return type:
str
- write_vcf(output_filename, hdf5_file_path=None, chunk_size=1000)[source]
Writes the GenotypeData object data to a VCF file in chunks.
This method writes the VCF data, bgzips the output file, indexes it with Tabix, and validates the output.
- Parameters:
output_filename (str) – The name of the output VCF file to write.
hdf5_file_path (str, optional) – The path to the HDF5 file containing the VCF attributes. Defaults to None.
chunk_size (int, optional) – The size of the chunks to read from the HDF5 file. Defaults to 1000.
- Returns:
The current instance of VCFReader.
- Return type: