DnaTokenizer¶
DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.
By default, DnaTokenizer
uses the standard alphabet.
If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
multimolecule.tokenisers.DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
nmers |
int
|
Size of kmer to tokenize. |
1
|
codon |
bool
|
Whether to tokenize into codons. |
False
|
replace_U_with_T |
bool
|
Whether to replace U with T. |
True
|
do_upper_case |
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNXVHDBMRWSYK.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10