features.embeddings

Text embedding helper utilities (VoyageAI, OpenAI, or sentence-transformers).

These are optional conveniences; install the matching provider packages before use: - voyage: pip install langchain-voyageai - openai: pip install openai - sentence-transformers: pip install sentence-transformers

class drugs.features.embeddings.text.TextEmbedConfig(provider: Literal['voyage', 'openai', 'sentence-transformers'] = 'voyage', model: str | None = None, normalize: bool = True, max_chars_per_chunk: int = 12000, voyage_api_key: str | None = None, openai_api_key: str | None = None)

Bases: object

Configuration for text embedding providers and credentials.

provider: Literal['voyage', 'openai', 'sentence-transformers'] = 'voyage'
model: str | None = None
normalize: bool = True
max_chars_per_chunk: int = 12000
voyage_api_key: str | None = None
openai_api_key: str | None = None
drugs.features.embeddings.text.make_text_embed_fn(cfg: TextEmbedConfig) Callable[[str], ndarray]

Create a text embedding callable for the configured provider.

Parameters:

cfg (TextEmbedConfig) – Provider selection, model name, normalization flag, and API keys.

Returns:

Function that consumes a text string and returns a 1D float32 embedding.

Return type:

Callable[[str], np.ndarray]

Raises:

ValueError – If required API keys are missing or provider is unknown.

ESM protein embedding helper using fair-esm models.

Optional dependency: pip install fair-esm. Sequences are fetched from UniProt.

drugs.features.embeddings.esm.fetch_uniprot_sequence(accession: str, *, timeout_s: int = 30) str

Fetch a protein sequence from UniProt’s REST API in FASTA format.

Parameters:
  • accession (str) – UniProt accession to fetch.

  • timeout_s (int, default=30) – Request timeout in seconds.

Returns:

Amino-acid sequence string.

Return type:

str

Raises:
  • HTTPError – If the response status is not successful.

  • ValueError – If the FASTA payload contains no sequence lines.

drugs.features.embeddings.esm.make_esm_embed_fn(*, model_name: str = 'esm2_t12_35M_UR50D', repr_layer: int | None = None, device: str | None = None) Callable[[List[str]], Tensor]

Factory that builds an ESM embedding function for UniProt accessions.

Parameters:
  • model_name (str, default="esm2_t12_35M_UR50D") – Name of the pretrained ESM model (esm.pretrained.<model_name>()).

  • repr_layer (int, optional) – Layer index to extract representations from. Defaults to the last layer.

  • device (str, optional) – Torch device ("cuda" or "cpu"). Auto-selects GPU if available.

Returns:

Function that accepts a list of UniProt accessions and returns a stacked tensor of per-sequence embeddings (mean pooled token representations).

Return type:

Callable[[List[str]], torch.Tensor]

Raises:
  • ValueError – If the model name is unknown.

  • ImportError – If fair-esm is not installed.

Feature-level helpers for embeddings and caching paths.

drugs.features.embeddings.path.default_embedding_path(identifier: str, kind: str, suffix: str) Path

Build a default filesystem path for an embedding artifact.

Parameters:
  • identifier (str) – Unique identifier for the entity (e.g., CID, ChEMBL ID).

  • kind (str) – Embedding type label (e.g., "text" or "protein").

  • suffix (str) – File suffix/extension without the leading dot.

Returns:

Path pointing to artifacts/embeddings/<kind>_<identifier>.<suffix> with slashes replaced to keep it filesystem-safe.

Return type:

Path