Metadata-Version: 2.4
Name: embpy
Version: 0.1.1
Summary: A package for biological embeddings in the perturbation experimental space
Project-URL: Documentation, https://embpy.readthedocs.io/
Project-URL: Homepage, https://github.com/theislab/embpy
Project-URL: Source, https://github.com/theislab/embpy
Author: Goncalo Rei Pinto
Maintainer-email: Goncalo Rei Pinto <goncalo.pinto@helmholtz-munich.de>
License: MIT License
        
        Copyright (c) 2025, Goncalo Rei Pinto
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Requires-Dist: anndata
Requires-Dist: biopython
Requires-Dist: broad-babel>=0.1
Requires-Dist: cirpy
Requires-Dist: huggingface-hub<1.0.0,>=0.35
Requires-Dist: ipywidgets
Requires-Dist: matplotlib
Requires-Dist: numpy<3,>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: protobuf>=3.20
Requires-Dist: pyarrow>=14.0.2
Requires-Dist: pyensembl>=2.3.13
Requires-Dist: pysam>=0.22
Requires-Dist: rdkit
Requires-Dist: requests
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.11
Requires-Dist: seaborn
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: session-info
Requires-Dist: torch-geometric>=2.5
Requires-Dist: transformers<5.0.0,>=4.45.0
Provides-Extra: all
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'all'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'all'
Requires-Dist: h5py; extra == 'all'
Requires-Dist: jump-portrait>=0.0.7; extra == 'all'
Requires-Dist: lamindb; extra == 'all'
Requires-Dist: pertpy; extra == 'all'
Requires-Dist: pillow>=10; extra == 'all'
Requires-Dist: scanpy>=1.10; extra == 'all'
Requires-Dist: torch>=2.5.1; extra == 'all'
Provides-Extra: all-cpu
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'all-cpu'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'all-cpu'
Requires-Dist: h5py; extra == 'all-cpu'
Requires-Dist: jump-portrait>=0.0.7; extra == 'all-cpu'
Requires-Dist: lamindb; extra == 'all-cpu'
Requires-Dist: pertpy; extra == 'all-cpu'
Requires-Dist: pillow>=10; extra == 'all-cpu'
Requires-Dist: scanpy>=1.10; extra == 'all-cpu'
Requires-Dist: torch>=2.5.1; extra == 'all-cpu'
Provides-Extra: all-cu121
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'all-cu121'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'all-cu121'
Requires-Dist: h5py; extra == 'all-cu121'
Requires-Dist: jump-portrait>=0.0.7; extra == 'all-cu121'
Requires-Dist: lamindb; extra == 'all-cu121'
Requires-Dist: pertpy; extra == 'all-cu121'
Requires-Dist: pillow>=10; extra == 'all-cu121'
Requires-Dist: scanpy>=1.10; extra == 'all-cu121'
Requires-Dist: torch>=2.5.1; extra == 'all-cu121'
Provides-Extra: all-cu124
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'all-cu124'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'all-cu124'
Requires-Dist: h5py; extra == 'all-cu124'
Requires-Dist: jump-portrait>=0.0.7; extra == 'all-cu124'
Requires-Dist: lamindb; extra == 'all-cu124'
Requires-Dist: pertpy; extra == 'all-cu124'
Requires-Dist: pillow>=10; extra == 'all-cu124'
Requires-Dist: scanpy>=1.10; extra == 'all-cu124'
Requires-Dist: torch>=2.5.1; extra == 'all-cu124'
Provides-Extra: all-cu128
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'all-cu128'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'all-cu128'
Requires-Dist: h5py; extra == 'all-cu128'
Requires-Dist: jump-portrait>=0.0.7; extra == 'all-cu128'
Requires-Dist: lamindb; extra == 'all-cu128'
Requires-Dist: pertpy; extra == 'all-cu128'
Requires-Dist: pillow>=10; extra == 'all-cu128'
Requires-Dist: scanpy>=1.10; extra == 'all-cu128'
Requires-Dist: torch>=2.5.1; extra == 'all-cu128'
Provides-Extra: boltz
Requires-Dist: boltz>=2.0; extra == 'boltz'
Provides-Extra: caduceus
Requires-Dist: mamba-ssm>=2.0; extra == 'caduceus'
Provides-Extra: dev
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: twine>=6.1; extra == 'dev'
Provides-Extra: doc
Requires-Dist: docutils!=0.18.*,!=0.19.*,>=0.8; extra == 'doc'
Requires-Dist: ipykernel; extra == 'doc'
Requires-Dist: ipython; extra == 'doc'
Requires-Dist: myst-nb>=1.1; extra == 'doc'
Requires-Dist: setuptools; extra == 'doc'
Requires-Dist: sphinx-autodoc-typehints; extra == 'doc'
Requires-Dist: sphinx-book-theme>=1; extra == 'doc'
Requires-Dist: sphinx-copybutton; extra == 'doc'
Requires-Dist: sphinx-tabs; extra == 'doc'
Requires-Dist: sphinx>=4; extra == 'doc'
Requires-Dist: sphinxcontrib-bibtex>=1; extra == 'doc'
Requires-Dist: sphinxext-opengraph; extra == 'doc'
Provides-Extra: esm3
Requires-Dist: esm>=3.2.0; extra == 'esm3'
Provides-Extra: eval
Requires-Dist: cell-eval>=0.7.2; extra == 'eval'
Provides-Extra: evo
Requires-Dist: evo-model>=0.3; extra == 'evo'
Provides-Extra: evo2
Requires-Dist: evo2; extra == 'evo2'
Provides-Extra: helical
Requires-Dist: helical; extra == 'helical'
Provides-Extra: lamindb
Requires-Dist: lamindb; extra == 'lamindb'
Provides-Extra: minimol
Requires-Dist: minimol; extra == 'minimol'
Provides-Extra: morphology
Requires-Dist: jump-portrait>=0.0.7; extra == 'morphology'
Requires-Dist: pillow>=10; extra == 'morphology'
Provides-Extra: ntv3
Requires-Dist: transformers>=5.0.0; extra == 'ntv3'
Provides-Extra: pertpy
Requires-Dist: pertpy; extra == 'pertpy'
Provides-Extra: ppi
Requires-Dist: h5py; extra == 'ppi'
Provides-Extra: scanpy
Requires-Dist: scanpy>=1.10; extra == 'scanpy'
Provides-Extra: seqmodels
Requires-Dist: borzoi-pytorch>=0.4.3; extra == 'seqmodels'
Requires-Dist: enformer-pytorch>=0.8.10; extra == 'seqmodels'
Provides-Extra: stack
Requires-Dist: arc-stack>=0.1.3; extra == 'stack'
Provides-Extra: state
Requires-Dist: arc-state>=0.10; extra == 'state'
Provides-Extra: test
Requires-Dist: coverage; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-mock; extra == 'test'
Provides-Extra: torch
Requires-Dist: torch>=2.5.1; extra == 'torch'
Provides-Extra: torch-cpu
Requires-Dist: torch>=2.5.1; extra == 'torch-cpu'
Provides-Extra: torch-cu121
Requires-Dist: torch>=2.5.1; extra == 'torch-cu121'
Provides-Extra: torch-cu124
Requires-Dist: torch>=2.5.1; extra == 'torch-cu124'
Provides-Extra: torch-cu128
Requires-Dist: torch>=2.5.1; extra == 'torch-cu128'
Provides-Extra: torch-cu130
Requires-Dist: torch>=2.5.1; extra == 'torch-cu130'
Description-Content-Type: text/markdown

# embpy

[![Tests][badge-tests]][tests]
[![Documentation][badge-docs]][documentation]

[badge-tests]: https://img.shields.io/github/actions/workflow/status/theislab/embpy/test.yaml?branch=main
[badge-docs]: https://img.shields.io/readthedocs/embpy
[tests]: https://github.com/theislab/embpy/actions/workflows/test.yaml
[documentation]: https://embpy.readthedocs.io/

**embpy** is a Python toolkit for generating biological embeddings with one
unified API.

Use it to embed genes, proteins, small molecules, morphology perturbations, and
single cells; annotate the resulting objects; and compare embeddings with
scverse-friendly plotting and analysis utilities.

<p align="center">
  <img src="docs/embpy_architecture.png" alt="embpy architecture" width="900"/>
</p>

## What embpy Does

- Embeds biological entities through `BioEmbedder.embed(...)`.
- Resolves biological identifiers into model-ready inputs, such as gene
  sequences, protein sequences, SMILES strings, and morphology images.
- Returns AnnData, tables, or payloads with provenance and canonical IDs.
- Stores generated embeddings outside `.X`, using `.obsm`, `.varm`, or `.uns`
  according to the entity type.
- Adds real metadata annotations for genes, proteins, molecules, and cell
  lines.
- Provides plotting and comparison helpers for embedding quality checks.

## Install

Pixi is recommended for development and GPU work:

```bash
pixi install -e default
pixi run -e default verify
```

For a pip install:

```bash
pip install embpy
```

For optional GPU/model extras, see the
[technical guide](docs/technical.md#environments-and-installation).

## Quick Start

```python
from embpy import BioEmbedder

embedder = BioEmbedder(device="auto", organism="human")
```

Embed genes with multiple model families:

```python
gene_adata = embedder.embed(
    ["TP53", "EGFR", "MYC"],
    entity_type="gene",
    id_type="symbol",
    model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
    output="anndata",
)

gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()
```

Embed gene perturbation labels as row-aligned action embeddings:

```python
# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
    pert_adata,
    entity_type="gene",
    obs_column="perturbation",
    id_type="symbol",
    model="esm2_650M",
    output="anndata",
    is_perturbation=True,
    key="X_pert_esm2_650M",
)

pert_adata.obsm["X_pert_esm2_650M"]
```

Embed proteins:

```python
protein_adata = embedder.embed(
    ["TP53", "EGFR", "BRCA1"],
    entity_type="protein",
    id_type="symbol",
    model="esm2_8M",
    output="anndata",
)
```

Embed small molecules:

```python
smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
]

molecule_adata = embedder.embed(
    smiles,
    entity_type="molecule",
    id_type="smiles",
    model="morgan_fp",
    output="anndata",
    key="X_morgan_fp",
)
```

Embed cells from AnnData with model-aware preprocessing:

```python
cell_adata = embedder.embed(
    adata,
    entity_type="cell",
    model="pca",
    preprocessing="auto",
    output="anndata",
    key="X_pca",
)

cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]
```

Annotate and plot:

```python
from embpy import tl, pl

molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
    molecule_adata,
    column="smiles",
    sources=["structural", "bioactivity", "ontology"],
)

pl.plot_embedding_space(
    molecule_adata,
    obsm_key="X_morgan_fp",
    method="pca",
    color="mol_logp",
)
```

## Tutorials

The tutorials are organized by biological entity:

- [Genes](docs/notebooks/genes.ipynb)
- [Proteins](docs/notebooks/proteins.ipynb)
- [Small molecules](docs/notebooks/small_molecules.ipynb)
- [Cells](docs/notebooks/cells.ipynb)

Each notebook uses real `BioEmbedder.embed(...)` calls, real annotation APIs,
and embpy plotting/comparison utilities.

## Model Families

embpy supports models across:

- DNA and regulatory sequence models
- protein language and structure models
- small-molecule fingerprints and chemical language models
- single-cell foundation models and classical baselines
- morphology models for HPA and JUMP-style images
- text models for biological descriptions

Use:

```python
embedder.list_available_models()
```

for the model keys available in your environment.

## Output Contract

`BioEmbedder.embed(...)` follows a scverse-friendly output contract:

- genes are feature-like and live in `.varm` by default
- gene perturbation labels use `is_perturbation=True` and live in `.obsm`
- proteins are feature-like and live in `.varm`
- molecules, text, sequences, and cells are observation-like and live in `.obsm`
- perturbation/action embeddings can be kept entity-aligned in `.uns`
- `.X` remains expression/count-like data or a sparse placeholder

See the [technical guide](docs/technical.md#standardized-output-contract) for
the full contract.

## Documentation

- [API reference](docs/api.md): per-function reference generated from docstrings
- [Technical guide](docs/technical.md): output contract, install matrix, package
  layout, and developer notes
- [Contributing](docs/contributing.md)
- [Changelog](docs/changelog.md)

## Citation

If you use embpy in your work, please cite the repository for now. A formal
citation will be added when the package is released.

## Contact

For questions, issues, or feature requests, open a GitHub issue or contact the
maintainers listed in the package metadata.
