RNA Alphabet¶
MultiMolecule provides a set of predefined alphabets for tokenization.
Standard Alphabet¶
The standard alphabet is an extended version of the IUPAC alphabet.
This extension includes three additional symbols to the IUPAC alphabet, I
, X
and *
.
I
: Inosine; is a post-trancriptional modification that is not a standard RNA base. Inosine is the result of a deamination reaction of adenines that is catalyzed by adenosine deaminases acting on tRNAs (ADATs)X
: Any base; is slightly different fromN
which represents Unknown base. In automatic word embedding conversion, theX
will be initialized as the mean ofA
,C
,G
, andU
, whileN
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.
gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
Standard Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
U | Uracil |
N | Unknown |
I | Inosine |
X | Any |
V | A, C, or G |
H | A, C, or U |
D | A, G, or U |
B | C, G, or U |
M | A or C |
R | A or G |
W | A or U |
S | C or G |
Y | C or U |
K | G or U |
. | Gap |
* | Not Used |
- | Not Used |
IUPAC Alphabet¶
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
IUPAC Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
U | Uracil |
R | A or G |
Y | C or U |
S | G or C |
W | A or U |
K | G or U |
M | A or C |
B | C, G, or U |
D | A, G, or U |
H | A, C, or U |
V | A, C, or G |
N | A, C, G, or U |
. | Gap |
Note that we use .
to represent a gap in the sequence.
Streamline Alphabet¶
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
IUPAC Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
U | Uracil |
N | Unknown |
Nucleobase Alphabet¶
The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and U
.
IUPAC Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
U | Uracil |