snpio.utils package

Submodules

snpio.utils.misc module

snpio.utils.misc.get_gt2iupac()[source]

Get a dictionary of genotype to IUPAC ambiguity codes.

Return type:

Dict[str, str]

snpio.utils.misc.get_int_iupac_dict()[source]

Get a dictionary of IUPAC ambiguity codes to integers.

Return type:

Dict[str, int]

snpio.utils.misc.get_iupac2gt()[source]

Get a dictionary of IUPAC ambiguity codes to genotype.

Return type:

Dict[str, str]

snpio.utils.misc.get_onehot_dict()[source]

Get a dictionary of IUPAC ambiguity codes to one-hot encoded vectors.

Return type:

Dict[str, List[float]]

snpio.utils.misc.validate_input_type(X, return_type='array')[source]

Validates the input type and returns it as a specified type.

This function checks if the input X is a pandas DataFrame, numpy array, or a list of lists. It then converts X to the specified return_type and returns it.

Parameters:
  • X (pandas.DataFrame, numpy.ndarray, or List[List[int]]) – The input data to validate and convert.

  • return_type (str, optional) – The type of the returned object. Supported options include: “df” (DataFrame), “array” (numpy array), and “list”. Defaults to “array”.

Returns:

The input data converted to the desired return type.

Return type:

pandas.DataFrame, numpy.ndarray, or List[List[int]]

Raises:
  • TypeError – If X is not of type pandas.DataFrame, numpy.ndarray, or List[List[int]].

  • ValueError – If an unsupported return_type is provided. Supported types are “df”, “array”, and “list”.

Example

>>> X = [[1, 2, 3], [4, 5, 6]]
>>> print(validate_input_type(X, "df"))  4
>>> # Outputs: a DataFrame with the data from `X`.

snpio.utils.sequence_tools module

snpio.utils.sequence_tools.count_alleles(l, vcf=False)[source]

Counts the total number of unique alleles in a list of genotypes.

This function takes a list of IUPAC or VCF-style (e.g. 0/1) genotypes and returns the total number of unique alleles. The genotypes can be in VCF or STRUCTURE-style format.

Parameters:
  • l (List[str]) – A list of IUPAC or VCF-style genotypes.

  • vcf (bool, optional) – If True, the genotypes are in VCF format. If False, the genotypes are in STRUCTURE-style format. Defaults to False.

Returns:

The total number of unique alleles in the list.

Return type:

int

Example

>>> l = ['A/A', 'A/T', 'T/T', 'A/A', 'A/T']
>>> print(count_alleles(l, vcf=True))
>>> # Outputs: 2

Note

The function removes any instances of “-9”, “-”, “N”, -9, “.”, “?” before counting the alleles.

snpio.utils.sequence_tools.get_iupac_caseless(char)[source]

Split IUPAC code to two primary characters, assuming diploidy.

Gives all non-valid ambiguities as N.

Parameters:

char (str) – Base to expand into diploid list.

Returns:

List of the two expanded alleles.

Return type:

List[str]

snpio.utils.sequence_tools.get_major_allele(l, num=None, vcf=False)[source]

Returns the most common alleles in a list.

This function takes a list of genotypes for one sample and returns the most common alleles in descending order. The alleles can be in VCF or STRUCTURE-style format.

Parameters:
  • l (List[str]) – A list of genotypes for one sample.

  • num (int, optional) – The number of elements to return. If None, all elements are returned. Defaults to None.

  • vcf (bool, optional) – If True, the alleles are in VCF format. If False, the alleles are in STRUCTURE-style format. Defaults to False.

Returns:

The most common alleles in descending order.

Return type:

List[str]

Example

>>> l = ['A/A', 'A/T', 'T/T', 'A/A', 'A/T']
>>> print(get_major_allele(l, vcf=True))  # Outputs: ['A', 'T']

Note

The function uses the Counter class from the collections module to count the occurrences of each allele.

snpio.utils.sequence_tools.get_revComp_caseless(char)[source]

Returns the reverse complement of a nucleotide, while preserving case.

This function takes a nucleotide character and returns its reverse complement according to the standard DNA base pairing rules. It also handles IUPAC ambiguity codes. The case of the input character is preserved in the output.

Parameters:

char (str) – The nucleotide character to be reverse complemented. Can be uppercase or lowercase.

Returns:

The reverse complement of the input character, with the same case.

Return type:

str

Example

>>> char = 'a'
>>> print(get_revComp_caseless(char))
>>> # Outputs: 't'

Note

  • The function supports the following IUPAC ambiguity codes: R (A/G), Y (C/T), S (G/C), W (A/T), K (G/T), M (A/C), B (C/G/T), D (A/G/T), H (A/C/T), V (A/C/G). It also supports N (any base) and - (gap).

snpio.utils.sequence_tools.remove_items(all_list, bad_list)[source]

Removes items from a list based on another list.

This function takes a list and removes any items that are present in a second list.

Parameters:
  • all_list (List[Any]) – The list from which items are to be removed.

  • bad_list (List[Any]) – The list containing items to be removed from the first list.

Returns:

The first list with any items present in the second list removed.

Return type:

List[Any]

Example

>>> all_list = ['a', 'b', 'c', 'd']
>>> bad_list = ['b', 'd']
>>> print(remove_items(all_list, bad_list))
>>> # Outputs: ['a', 'c']
snpio.utils.sequence_tools.seqCounter(seq)[source]

Returns a dictionary of character counts in a DNA sequence.

This function takes a DNA sequence and returns a dictionary where the keys are nucleotide characters and the values are their counts in the sequence. It also handles IUPAC ambiguity codes. The function is case-sensitive.

Parameters:

seq (str) – The DNA sequence to be counted.

Returns:

A dictionary where the keys are nucleotide characters and the values are their counts in the sequence. The dictionary also includes a ‘VAR’ key, which is the sum of the counts of all IUPAC ambiguity codes.

Return type:

Dict[str, int]

Example

>>> seq = 'ATGCRYSWKMBDHVN'
>>> print(seqCounter(seq))
{'A': 1, 'N': 1, '-': 0, 'C': 1, 'G': 1, 'T': 1, 'R': 1, 'Y': 1, 'S': 1, 'W': 1, 'K': 1, 'M': 1, 'B': 1, 'D': 1, 'H': 1, 'V': 1, 'VAR': 10}

Note

The function supports the following IUPAC ambiguity codes: R (A/G), Y (C/T), S (G/C), W (A/T), K (G/T), M (A/C), B (C/G/T), D (A/G/T), H (A/C/T), V (A/C/G). It also supports N (any base) and - (gap).

snpio.utils.sequence_tools.simplifySeq(seq)[source]

Simplifies a DNA sequence by replacing all nucleotides and IUPAC ambiguity codes with asterisks.

This function takes a DNA sequence and returns a simplified version where all nucleotides (A, C, G, T) and IUPAC ambiguity codes (R, Y, S, W, K, M, B, D, H, V) are replaced with asterisks (*). The function is case-insensitive.

Parameters:

seq (str) – The DNA sequence to be simplified.

Returns:

The simplified sequence, where all nucleotides and IUPAC ambiguity codes are replaced with asterisks (*).

Return type:

str

Example

>>> seq = 'ATGCRYSWKMBDHVN'
>>> print(simplifySeq(seq))
>>> # Outputs: '*************'

Note

The function supports the following IUPAC ambiguity codes: R (A/G), Y (C/T), S (G/C), W (A/T), K (G/T), M (A/C), B (C/G/T), D (A/G/T), H (A/C/T), V (A/C/G).

Module contents