跳转至

RNA Alphabet

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I, X and *.

  • I: Inosine; is a post-trancriptional modification that is not a standard RNA base. Inosine is the result of a deamination reaction of adenines that is catalyzed by adenosine deaminases acting on tRNAs (ADATs)
  • X: Any base; is slightly different from N which represents Unknown base. In automatic word embedding conversion, the X will be initialized as the mean of A, C, G, and U, while N will not be further processed.
  • *: is not used in MultiMolecule and is reserved for future use.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Standard Code Represents
A Adenine
C Cytosine
G Guanine
U Uracil
N Unknown
I Inosine
X Any
V A, C, or G
H A, C, or U
D A, G, or U
B C, G, or U
M A or C
R A or G
W A or U
S C or G
Y C or U
K G or U
. Gap
* Not Used
- Not Used

IUPAC Alphabet

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

IUPAC Code Represents
A Adenine
C Cytosine
G Guanine
U Uracil
R A or G
Y C or U
S G or C
W A or U
K G or U
M A or C
B C, G, or U
D A, G, or U
H A, C, or U
V A, C, or G
N A, C, G, or U
. Gap

Note that we use . to represent a gap in the sequence.

Streamline Alphabet

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

IUPAC Code Nucleotide
A Adenine
C Cytosine
G Guanine
U Uracil
N Unknown

Nucleobase Alphabet

The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A, C, G, and U.

IUPAC Code Nucleotide
A Adenine
C Cytosine
G Guanine
U Uracil