DNA Alphabet¶
MultiMolecule provides a set of predefined alphabets for tokenization.
Standard Alphabet¶
The standard alphabet is an extended version of the IUPAC alphabet.
This extension includes two additional symbols to the IUPAC alphabet, X
and *
.
X
: Any base; is slightly different fromN
which represents Unknown base. In automatic word embedding conversion, theX
will be initialized as the mean ofA
,C
,G
, andT
, whileN
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.
gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
Standard Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
N | Unknown |
X | Any |
V | A, C, or G |
H | A, C, or T |
D | A, G, or T |
B | C, G, or T |
M | A or C |
R | A or G |
W | A or T |
S | C or G |
Y | C or T |
K | G or T |
. | Gap |
* | Not Used |
- | Not Used |
IUPAC Alphabet¶
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
IUPAC Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
R | A or G |
Y | C or T |
S | C or G |
W | A or T |
K | G or T |
M | A or C |
B | C, G, or T |
D | A, G, or T |
H | A, C, or T |
V | A, C, or G |
N | A, C, G, or T |
. | Gap |
Note that we use .
to represent a gap in the sequence.
Streamline Alphabet¶
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
IUPAC Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
N | Unknown |
Nucleobase Alphabet¶
The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and T
.
IUPAC Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |