Skip to content

ProteinTokenizer

ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.

By default, ProteinTokenizer uses the standard alphabet.

multimolecule.tokenisers.ProteinTokenizer

Bases: Tokenizer

Tokenizer for Protein sequences.

Parameters:

Name Type Description Default
alphabet Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None
do_upper_case bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import ProteinTokenizer
>>> tokenizer = ProteinTokenizer()
>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXBZJUO')["input_ids"]
[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
>>> tokenizer('manlgcwmlv')["input_ids"]
[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
Source code in multimolecule/tokenisers/protein/tokenization_protein.py
Python
class ProteinTokenizer(Tokenizer):
    """
    Tokenizer for Protein sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import ProteinTokenizer
        >>> tokenizer = ProteinTokenizer()
        >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXBZJUO')["input_ids"]
        [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
        >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
        >>> tokenizer('manlgcwmlv')["input_ids"]
        [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        additional_special_tokens: List | Tuple | None = None,
        do_upper_case: bool = True,
        **kwargs,
    ):
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet)
        super().__init__(
            alphabet=alphabet,
            additional_special_tokens=additional_special_tokens,
            do_upper_case=do_upper_case,
            **kwargs,
        )

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        return list(text)