ProteinTokenizer¶
ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.
By default, ProteinTokenizer
uses the standard alphabet.
multimolecule.tokenisers.ProteinTokenizer
¶
Bases: Tokenizer
Tokenizer for Protein sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
do_upper_case |
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Python Console Session
>>> from multimolecule import ProteinTokenizer
>>> tokenizer = ProteinTokenizer()
>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXBZJUO')["input_ids"]
[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
>>> tokenizer('manlgcwmlv')["input_ids"]
[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]