UME Architecture

Universal Molecular Encoder

Input Modalities

Amino Acid Sequences

Protein sequences with standard amino acid alphabet

MKTVRQERLKSIVRIL...

SMILES Strings

Chemical structure representations

CC(=O)OC1=CC=CC=C1C(=O)O

Nucleotide Sequences

DNA/RNA sequences with nucleotide alphabet

ATGCATGCTAGCTAGCTAGCTAG

3D Coordinates

Spatial molecular structure data

[x,y,z] coordinate matrices

Unified Tokenization System

Modality-aware tokenization with shared vocabulary space

Modality-Specific CLS Tokens:
<cls_amino_acid>, <cls_smiles>, <cls_nucleotide>
Shared Special Tokens:
<cls>, <eos>, <mask>, <pad>, <unk>, <sep>
Unified Vocabulary:
~1,280 tokens aligned across all modalities

FlexBERT Transformer Backbone

Shared parameters across all biological modalities

12M
UME Mini
740M
UME Large
768+
Embedding Dim
8192
Max Seq Length
Multi-Head Self-Attention
Enhanced with Flash Attention 2 for efficient processing
Shared Parameters
Universal representations across all modalities
Scalable Architecture
From 12M to 740M parameters with consistent performance

Training Objectives

Masked Language Modeling (MLM)

  • 25% Token Masking Strategy
  • Cross-Entropy Loss Function
  • Per-Modality Performance Metrics
  • Self-Supervised Learning

Contrastive Learning

  • InfoNCE Loss for Alignment
  • Symile Loss for Similarity
  • DisCo Loss for Discrimination
  • Cross-Modal Representation Learning

Unified Embedding Space

Cross-modal compatible representations

Sequence-Level Embeddings:
(batch_size, embedding_dim)
Token-Level Embeddings:
(batch_size, seq_len, embedding_dim)
Cross-Modal Compatibility:
All modalities share the same embedding space

Training Datasets

M320M: 19.4M SMILES strings for chemical understanding
CALM: 7.9M nucleotide sequences for genomic learning
AMPLIFY: 448M amino acid sequences for protein modeling
Pinder: 267K 3D coordinate structures
OpenGenome2: 28.8B nucleotides for large-scale genomics
ZINC: 1.54B SMILES for comprehensive chemical coverage

Key Features

Unified Vocabulary
Modality-Aware Tokenization
Shared Transformer Parameters
Intra-Modality Alignment
Inter-Modality Alignment
Flash Attention 2
Scalable Architecture
Cross-Modal Embeddings