RnaTokenizer¶
RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.
By default, RnaTokenizer
uses the standard alphabet.
If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
multimolecule.tokenisers.RnaTokenizer
¶
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
nmers |
int
|
Size of kmer to tokenize. |
1
|
codon |
bool
|
Whether to tokenize into codons. |
False
|
replace_T_with_U |
bool
|
Whether to replace T with U. |
True
|
do_upper_case |
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Python Console Session
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNIXVHDBMRWSYK.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10