features.embeddings
Text embedding helper utilities (VoyageAI, OpenAI, or sentence-transformers).
These are optional conveniences; install the matching provider packages before use:
- voyage: pip install langchain-voyageai
- openai: pip install openai
- sentence-transformers: pip install sentence-transformers
- class drugs.features.embeddings.text.TextEmbedConfig(provider: Literal['voyage', 'openai', 'sentence-transformers'] = 'voyage', model: str | None = None, normalize: bool = True, max_chars_per_chunk: int = 12000, voyage_api_key: str | None = None, openai_api_key: str | None = None)
Bases:
objectConfiguration for text embedding providers and credentials.
- provider: Literal['voyage', 'openai', 'sentence-transformers'] = 'voyage'
- model: str | None = None
- normalize: bool = True
- max_chars_per_chunk: int = 12000
- voyage_api_key: str | None = None
- openai_api_key: str | None = None
- drugs.features.embeddings.text.make_text_embed_fn(cfg: TextEmbedConfig) Callable[[str], ndarray]
Create a text embedding callable for the configured provider.
- Parameters:
cfg (TextEmbedConfig) – Provider selection, model name, normalization flag, and API keys.
- Returns:
Function that consumes a text string and returns a 1D float32 embedding.
- Return type:
Callable[[str], np.ndarray]
- Raises:
ValueError – If required API keys are missing or provider is unknown.
ESM protein embedding helper using fair-esm models.
Optional dependency: pip install fair-esm. Sequences are fetched from UniProt.
- drugs.features.embeddings.esm.fetch_uniprot_sequence(accession: str, *, timeout_s: int = 30) str
Fetch a protein sequence from UniProt’s REST API in FASTA format.
- Parameters:
accession (str) – UniProt accession to fetch.
timeout_s (int, default=30) – Request timeout in seconds.
- Returns:
Amino-acid sequence string.
- Return type:
str
- Raises:
HTTPError – If the response status is not successful.
ValueError – If the FASTA payload contains no sequence lines.
- drugs.features.embeddings.esm.make_esm_embed_fn(*, model_name: str = 'esm2_t12_35M_UR50D', repr_layer: int | None = None, device: str | None = None) Callable[[List[str]], Tensor]
Factory that builds an ESM embedding function for UniProt accessions.
- Parameters:
model_name (str, default="esm2_t12_35M_UR50D") – Name of the pretrained ESM model (
esm.pretrained.<model_name>()).repr_layer (int, optional) – Layer index to extract representations from. Defaults to the last layer.
device (str, optional) – Torch device (
"cuda"or"cpu"). Auto-selects GPU if available.
- Returns:
Function that accepts a list of UniProt accessions and returns a stacked tensor of per-sequence embeddings (mean pooled token representations).
- Return type:
Callable[[List[str]], torch.Tensor]
- Raises:
ValueError – If the model name is unknown.
ImportError – If
fair-esmis not installed.
Feature-level helpers for embeddings and caching paths.
- drugs.features.embeddings.path.default_embedding_path(identifier: str, kind: str, suffix: str) Path
Build a default filesystem path for an embedding artifact.
- Parameters:
identifier (str) – Unique identifier for the entity (e.g., CID, ChEMBL ID).
kind (str) – Embedding type label (e.g.,
"text"or"protein").suffix (str) – File suffix/extension without the leading dot.
- Returns:
Path pointing to
artifacts/embeddings/<kind>_<identifier>.<suffix>with slashes replaced to keep it filesystem-safe.- Return type:
Path