Metadata-Version: 2.4
Name: manylatents-omics
Version: 0.1.2
Summary: Biological extensions for manylatents: popgen and central dogma encoders
Project-URL: Homepage, https://github.com/latent-reasoning-works/manylatents-omics
Project-URL: Documentation, https://latent-reasoning-works.github.io/manylatents-omics/
Project-URL: Repository, https://github.com/latent-reasoning-works/manylatents-omics
Author-email: César Miguel Valdez Córdova <cesar.valdez@mila.quebec>, Matthew Scicluna <matthew.scicluna@mila.quebec>
License-Expression: MIT
License-File: LICENSE
Keywords: central-dogma,dimensionality-reduction,foundation-models,machine-learning,population-genetics
Requires-Python: <3.13,>=3.11
Requires-Dist: manylatents
Provides-Extra: dev
Requires-Dist: anndata>=0.11.3; extra == 'dev'
Requires-Dist: esm>=3.0; extra == 'dev'
Requires-Dist: geosketch>=1.3; extra == 'dev'
Requires-Dist: leidenalg>=0.10; extra == 'dev'
Requires-Dist: mamba-ssm==2.2.6.post3; extra == 'dev'
Requires-Dist: orthrus; extra == 'dev'
Requires-Dist: pandas-plink>=2.2.9; extra == 'dev'
Requires-Dist: pytest>=9.0.2; extra == 'dev'
Requires-Dist: python-igraph>=1.0.0; extra == 'dev'
Requires-Dist: scanpy>=1.11.5; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.6; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Provides-Extra: dogma
Requires-Dist: esm>=3.0; extra == 'dogma'
Requires-Dist: mamba-ssm==2.2.6.post3; extra == 'dogma'
Requires-Dist: orthrus; extra == 'dogma'
Provides-Extra: gpu
Requires-Dist: faiss-gpu-cu12; extra == 'gpu'
Provides-Extra: popgen
Requires-Dist: pandas-plink>=2.2.9; extra == 'popgen'
Provides-Extra: singlecell
Requires-Dist: anndata>=0.11.3; extra == 'singlecell'
Requires-Dist: geosketch>=1.3; extra == 'singlecell'
Requires-Dist: leidenalg>=0.10; extra == 'singlecell'
Requires-Dist: python-igraph>=1.0.0; extra == 'singlecell'
Requires-Dist: scanpy>=1.11.5; extra == 'singlecell'
Description-Content-Type: text/markdown

<div align="center">

<pre>
  A T G . . C A T        .  . .
  . G C A T . G .   -->  . .. .  -->  λ(·)
   T . A G C . A T        .  . .
     A T G . C A

     m a n y l a t e n t s - o m i c s

        from sequence to manifold
</pre>

[![license](https://img.shields.io/badge/license-MIT-FDA4AF.svg)](LICENSE)
[![python](https://img.shields.io/badge/python-3.11+-FDA4AF.svg)](https://www.python.org)
[![uv](https://img.shields.io/badge/pkg-uv-FDA4AF.svg)](https://docs.astral.sh/uv/)
[![PyPI](https://img.shields.io/badge/PyPI-manylatents--omics-FDA4AF.svg)](https://pypi.org/project/manylatents-omics/)
[![docs](https://img.shields.io/badge/docs-GitHub%20Pages-FDA4AF.svg)](https://latent-reasoning-works.github.io/manylatents-omics/)

</div>

---

Population genetics, single-cell, and foundation model encoders for [manylatents](https://github.com/latent-reasoning-works/manylatents). Extends the core DR framework with biological data types and domain-specific metrics.

## Install

```bash
uv add manylatents-omics
```

Optional extras:

```bash
uv add "manylatents-omics[popgen]"      # population genetics
uv add "manylatents-omics[singlecell]"  # single-cell (scanpy, anndata)
uv add "manylatents-omics[dogma]"       # protein + RNA encoders (ESM3, Orthrus)
```

> **DNA encoder (Evo2)** requires a separate venv due to torch version conflicts. See `scripts/setup-dna-venv.sh`.

Or from the core manylatents repo:

```bash
uv sync --extra omics   # installs manylatents-omics as a namespace extension
```

<details>
<summary>development install</summary>

```bash
git clone https://github.com/latent-reasoning-works/manylatents-omics.git
cd manylatents-omics && uv sync
```

</details>

## Architecture

manylatents-omics is a **namespace extension** of [manylatents](https://github.com/latent-reasoning-works/manylatents). It lives alongside the core repo and adds domain-specific modules under the `manylatents.*` namespace via `pkgutil.extend_path()`.

```
lrw/
├── manylatents/    # core DR engine
├── omics/          # this repo — popgen, singlecell, dogma encoders
└── shop/           # cluster infrastructure
```

**Design decision:** The core engine stays domain-agnostic. Each "flavor pack" (omics, vision, etc.) is a separate repo/package that extends the `manylatents` namespace without polluting the core with domain-specific dependencies. Experiment configs (ClinVar pipelines, fusion sweeps, cluster resource presets) belong in downstream experiment repos, not here — this package ships only instantiation configs that define what encoders, datasets, and algorithms *are*.

## Quick start

Omics configs are auto-discovered when the package is installed:

```bash
python -m manylatents.main --config-name=config \
  experiment=single_algorithm data=pbmc_3k

# Sweep on cluster
python -m manylatents.main -m \
  cluster=tamia resources=gpu \
  data=hgdp,pbmc_10k algorithms/latent=umap,phate
```

## Modules

**[popgen](manylatents/popgen/)** — Population genetics via the [manifold-genetics](https://github.com/latent-reasoning-works/manifold-genetics) CSV pipeline. HGDP+1KGP, UK Biobank, All of Us. Admixture proportions, geographic metadata, QC/relatedness filtering. Requires preprocessing via manifold-genetics (a separate tool, not a Python dependency). Configs: [`popgen/configs/`](manylatents/popgen/configs/)
  - **[GeographicPreservation](manylatents/popgen/metrics/preservation.py)** — Spearman correlation between haversine and embedding distances
  - **[AdmixturePreservation](manylatents/popgen/metrics/preservation.py)** — Geodesic distance fidelity in admixture simplex vs. latent space

**[singlecell](manylatents/singlecell/)** — AnnData `.h5ad` loader for scRNA-seq, scATAC-seq, CITE-seq. Ships with PBMC 3k/10k/68k and Embryoid Body. Any `.h5ad` works via `AnnDataset`. Configs: [`singlecell/configs/`](manylatents/singlecell/configs/)

**[dogma](manylatents/dogma/)** — Foundation model encoders for DNA, RNA, and protein sequences. Supports single-modality encoding, multi-layer extraction, and cross-modal fusion. All encoders inherit from [`FoundationEncoder`](manylatents/dogma/encoders/base.py) — lazy model loading, batched encoding with OOM retry, standard `fit()`/`transform()` interface. Configs: [`dogma/configs/`](manylatents/dogma/configs/)
  - **[ESM3](manylatents/dogma/encoders/esm3.py)** — Protein, 1536-dim, masked mean-pool, true batched forward
  - **[Evo2](manylatents/dogma/encoders/evo2.py)** — DNA, 1920/4096/8192-dim (1B/7B/40B), multi-layer extraction, 1M bp context
  - **[Orthrus](manylatents/dogma/encoders/orthrus_native.py)** — RNA, 256/512-dim (4-track/6-track), Mamba SSM re-implementation for mamba-ssm 2.x
  - **[AlphaGenome](manylatents/dogma/encoders/alphagenome.py)** — DNA, 1536/3072-dim (1bp/128bp), JAX-based, regulatory track predictions, chunked encoding

## ClinVar pipeline

Reference pipeline for variant-effect analysis via geometric methods. Encodes DNA and protein sequences flanking ClinVar variants, then applies dimensionality reduction to study how pathogenic vs. benign variants separate in embedding space. Three stages: DNA encoding, protein encoding, and geometric analysis (fusion + DR). See [docs/clinvar_pipeline.md](docs/clinvar_pipeline.md) for full details. Experiment configs live in downstream repos (e.g. merging_dogma), not in this package.

## Development

```bash
uv sync
pytest tests/ -v
```

## Citing

If manylatents-omics was useful in your research, a citation goes a long way:

```bibtex
@software{manylatents_omics2026,
  title     = {manyLatents-Omics: Biological Extensions for Unified Dimensionality Reduction},
  author    = {Scicluna, Matthew and Valdez C{\'o}rdova, C{\'e}sar Miguel},
  year      = {2026},
  url       = {https://github.com/latent-reasoning-works/manylatents-omics},
  license   = {MIT}
}
```

---

<div align="center">

MIT License · Latent Reasoning Works

</div>
