Metadata-Version: 2.4
Name: alphagenome-pytorch
Version: 0.3.1
Summary: PyTorch implementation of AlphaGenome
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: numpy
Requires-Dist: safetensors
Requires-Dist: torch>=2.0.0
Provides-Extra: dev
Requires-Dist: aiohttp; extra == 'dev'
Requires-Dist: alphagenome; extra == 'dev'
Requires-Dist: chex; extra == 'dev'
Requires-Dist: dm-haiku; extra == 'dev'
Requires-Dist: einshape; extra == 'dev'
Requires-Dist: jax; extra == 'dev'
Requires-Dist: jaxlib; extra == 'dev'
Requires-Dist: kagglehub; extra == 'dev'
Requires-Dist: logomaker; extra == 'dev'
Requires-Dist: myst-parser; extra == 'dev'
Requires-Dist: orbax-checkpoint; extra == 'dev'
Requires-Dist: pandas; extra == 'dev'
Requires-Dist: pybigwig; extra == 'dev'
Requires-Dist: pyfaidx; extra == 'dev'
Requires-Dist: pyranges; extra == 'dev'
Requires-Dist: pytest-xdist; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: requests; extra == 'dev'
Requires-Dist: sphinx-book-theme; extra == 'dev'
Requires-Dist: sphinx-design; extra == 'dev'
Requires-Dist: sphinx>=7.0.0; extra == 'dev'
Requires-Dist: tensorflow; extra == 'dev'
Requires-Dist: tqdm; extra == 'dev'
Provides-Extra: docs
Requires-Dist: myst-parser; extra == 'docs'
Requires-Dist: sphinx-book-theme; extra == 'docs'
Requires-Dist: sphinx-design; extra == 'docs'
Requires-Dist: sphinx>=7.0.0; extra == 'docs'
Provides-Extra: finetuning
Requires-Dist: pandas; extra == 'finetuning'
Requires-Dist: pybigwig; extra == 'finetuning'
Requires-Dist: pyfaidx; extra == 'finetuning'
Requires-Dist: tqdm; extra == 'finetuning'
Provides-Extra: inference
Requires-Dist: pybigwig; extra == 'inference'
Requires-Dist: pyfaidx; extra == 'inference'
Requires-Dist: tqdm; extra == 'inference'
Provides-Extra: jax
Requires-Dist: dm-haiku; extra == 'jax'
Requires-Dist: jax; extra == 'jax'
Requires-Dist: jaxlib; extra == 'jax'
Requires-Dist: orbax-checkpoint; extra == 'jax'
Provides-Extra: scoring
Requires-Dist: alphagenome; extra == 'scoring'
Requires-Dist: pandas; extra == 'scoring'
Requires-Dist: pyfaidx; extra == 'scoring'
Requires-Dist: pyranges; extra == 'scoring'
Requires-Dist: tqdm; extra == 'scoring'
Provides-Extra: test
Requires-Dist: aiohttp; extra == 'test'
Requires-Dist: alphagenome; extra == 'test'
Requires-Dist: chex; extra == 'test'
Requires-Dist: einshape; extra == 'test'
Requires-Dist: kagglehub; extra == 'test'
Requires-Dist: logomaker; extra == 'test'
Requires-Dist: pybigwig; extra == 'test'
Requires-Dist: pyfaidx; extra == 'test'
Requires-Dist: pyranges; extra == 'test'
Requires-Dist: pytest-xdist; extra == 'test'
Requires-Dist: pytest>=7.0.0; extra == 'test'
Requires-Dist: requests; extra == 'test'
Requires-Dist: tensorflow; extra == 'test'
Requires-Dist: tqdm; extra == 'test'
Description-Content-Type: text/markdown

# AlphaGenome PyTorch

[![PyPI](https://img.shields.io/pypi/v/alphagenome-pytorch)](https://pypi.org/project/alphagenome-pytorch/)
[![Documentation](https://readthedocs.org/projects/alphagenome-pytorch/badge/?version=latest)](https://alphagenome-pytorch.readthedocs.io)
[![Weights](https://img.shields.io/badge/%F0%9F%A4%97-Weights-yellow)](https://huggingface.co/gtca/alphagenome_pytorch)

A PyTorch port of [AlphaGenome](https://www.nature.com/articles/s41586-025-10014-0), the DNA sequence model from Google DeepMind that predicts hundreds of genomic tracks at single base-pair resolution from sequences up to 1M bp.

We strive to make it an accessible, readable, and hackable implementation — for integrating into existing PyTorch pipelines, fine-tuning on custom datasets, and building on top of.

## Installation

Installation from PyPI:

```bash
pip install alphagenome-pytorch
```

Installation from repo: 

```bash
pip install git+https://github.com/genomicsxai/alphagenome-pytorch
```

For fine-tuning (incl. BigWig data loading):

```bash
pip install alphagenome-pytorch[finetuning]  # adds pyBigWig, pyfaidx
```

## Quick Start

```python
import torch
import numpy as np
from alphagenome_pytorch import AlphaGenome

# Load pretrained model
model = AlphaGenome.from_pretrained('alphagenome.pt', device='cuda')

# Create one-hot encoded DNA sequence in NLC format (batch=1, length=131072, channels=4)
# Channels: A=0, C=1, G=2, T=3
sequence = np.random.randint(0, 4, size=(1, 131072))
dna_onehot = torch.tensor(np.eye(4)[sequence], dtype=torch.float32).cuda()

# Inference (handles dtype casting, returns float32 outputs)
outputs = model.predict(dna_onehot, organism_index=0)  # organism: 0=human, 1=mouse
```

The weights for this port are [available on Hugging Face](https://huggingface.co/gtca/alphagenome_pytorch).

### Output structure
Each genomic-track head returns a dict mapping resolution → tensor:

```python
outputs['atac'][1]           # (1, 131072, 256)   ATAC-seq  at 1 bp
outputs['atac'][128]         # (1, 1024,   256)   ATAC-seq  at 128 bp
outputs['dnase'][1]          # (1, 131072, 384)   DNase     at 1 bp
outputs['cage'][128]         # (1, 1024,   640)   CAGE      at 128 bp
outputs['chip_histone'][128] # (1, 1024,   1152)  ChIP-hist at 128 bp only
```

Contact maps are returned as a single tensor (no resolution dict):

```python
outputs['contact_maps']      # (1, 64, 64, 28)   3D chromatin contacts
```

Splice heads return dicts of tensors:

```python
outputs['splice_sites']['probs']  # (1, 131072, 5)  splice site classes
```

### Padding

Track dimensions are padded (e.g. ATAC has 167 real human
tracks but the tensor has 256 channels).  Real tracks come first; the rest
are zeros.  Use `named_outputs=True` to auto-strip padding:

```python
from alphagenome_pytorch.named_outputs import NamedOutputs, TrackMetadataCatalog
catalog = TrackMetadataCatalog.load_builtin(organism=0)
model.set_track_metadata_catalog(catalog)

named = model.predict(dna_onehot, organism_index=0, named_outputs=True)
named.atac[1].shape                  # (1, 131072, 167)  — padding removed
named.atac[1].tracks[-1].track_name  # 'UBERON:0015143 ATAC-seq'

# Filter by metadata
named.rna_seq[128].select(strand='+')
named.chip_tf[128].select(transcription_factor='CTCF')
named.atac[1].select(biosample_type='tissue', ontology_curie='UBERON:0015143')
```

## Extracting Embeddings

Use `model.encode()` to get embeddings without running prediction heads — useful for
building custom heads or analyzing representations:

```python
# Get embeddings (128bp only for efficiency)
emb = model.encode(dna_onehot, organism_index=0, resolutions=(128,))
emb['embeddings_128bp']  # (B, 1024, 3072) at 128bp
```

## Fine-tuning

Train a new head on your data with frozen trunk (linear probing) or with LoRA adapters:

```python
from alphagenome_pytorch import AlphaGenome, TransferConfig, load_trunk, prepare_for_transfer

# Load trunk, freeze, add custom heads
model = AlphaGenome()
model = load_trunk(model, 'alphagenome.pt')
model = prepare_for_transfer(model, TransferConfig(
    mode='lora',
    new_heads={'atac': {'modality': 'atac', 'num_tracks': 1}},
    lora_rank=8,
))
```

The easiest way to start with fine-tuning is to use [`scripts/finetune.py`](scripts/finetune.py) that implements a flexible CLI interface:

```bash
# LoRA fine-tuning
python scripts/finetune.py --mode lora --lora-rank 8 \
    --genome hg38.fa --modality atac --bigwig *.bw \
    --train-bed train.bed --val-bed val.bed \
    --pretrained-weights alphagenome.pt

# Multi-GPU
torchrun --nproc_per_node=4 scripts/finetune.py --mode lora ...
```

See [`examples/notebooks/finetune_linear_probe.ipynb`](examples/notebooks/finetune_linear_probe.ipynb) for an example of linear probing on ATAC-seq data.

## Numerical Parity with JAX

This port is validated against [the original JAX model](https://github.com/google-deepmind/alphagenome_research), including per-head and full forward pass output comparisons as well as loss values and gradients.

See a compiled [ARCHITECTURE_COMPARISON.md](ARCHITECTURE_COMPARISON.md) for some technical details.

## Model Outputs

| Head | Tracks (human) | Dimension (padded) | Resolutions | Description |
|------|--------|-----------|-------------|-------------|
| atac | 167 | 256 | 1bp, 128bp | Chromatin accessibility |
| dnase | 305 | 384 | 1bp, 128bp | DNase-seq |
| procap | 12 | 128 | 1bp, 128bp | Transcription initiation |
| cage | 546 | 640 | 1bp, 128bp | 5' cap RNA |
| rna_seq | 667 | 768 | 1bp, 128bp | RNA expression |
| chip_tf | 1617 | 1664 | 128bp | TF binding |
| chip_histone | 1116 | 1152 | 128bp | Histone modifications |
| contact_maps | 28 | 28 | 64×64 | 3D chromatin contacts |
| splice_sites | 5 | 5 | 1bp | Splice site classification (D+, A+, D−, A−, none) |
| splice_junctions | 734 | 734 | pairwise | Junction read counts (367 tissues × 2 strands) |
| splice_site_usage | 734 | 734 | 1bp | Fraction of transcripts using splice site |

Tracks column shows the number of real human tracks (without padding). Dimension is the raw output tensor size — padding fills the gap. When using `named_outputs=True`, padding is stripped by default. See [named outputs guide](docs/named_outputs.rst) for details.

See more information about model outputs [in the official AlphaGenome documentation](https://www.alphagenomedocs.com/exploring_model_metadata.html).

## Example Notebooks

- [Demo](examples/notebooks/alphagenome_pytorch_demo.ipynb) — Basic inference and JAX comparison
- [Variant Scoring](examples/notebooks/variant_scoring.ipynb) — Effect prediction
- [In Silico Mutagenesis](examples/notebooks/in_silico_mutagenesis.ipynb) — ISM analysis
- [TAL1 Mutation Example](examples/notebooks/TAL1_variant_effect_and_ISM.ipynb) - TAL1 variant effect and ISM (Figure 6 from AlphaGenome)
- [Fine-tuning](examples/notebooks/finetune_linear_probe.ipynb) — ATAC-seq linear probing
- [Fine-tuning](examples/notebooks/finetune_encoder_only.ipynb) — MPRA (encoder-only)

## Citation

```bibtex
@article{avsec2026alphagenome,
  title={Advancing regulatory variant effect prediction with AlphaGenome},
  author={Avsec, {\v{Z}}iga and Latysheva, Natasha and Cheng, Jun and Novati, Guido and Taylor, Kyle R and Ward, Tom and Bycroft, Clare and Nicolaisen, Lauren and Arvaniti, Eirini and Pan, Joshua and others},
  journal={Nature},
  volume={649},
  number={8099},
  pages={1206--1218},
  year={2026},
  publisher={Nature Publishing Group UK London}
}
```

<details>
<summary>bioRxiv preprint</summary>

```bibtex
@article{avsec2025alphagenome,
    title = {AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model},
    author = {Avsec, {\v Z}iga and Latysheva, Natasha and Cheng, Jun and ...},
    year = {2025},
    journal = {bioRxiv},
    doi = {10.1101/2025.06.25.661532}
}
```

</details>

## Acknowledgements

We acknowledge [Phil Wang](https://gitlab.com/lucidrains), [Miquel Anglada-Girotto](https://github.com/MiqG), and [Xinming Tu](https://github.com/XinmingTu) as developers of an older AlphaGenome PyTorch port unrelated to this repo. Note that the PyPI namespace is now linked to this repo.

## License

This project is a port of the [google-deepmind/alphagenome_research](https://github.com/google-deepmind/alphagenome_research) repository licensed under the Apache License, Version 2.0:

>Copyright 2026 Google LLC

The model parameters, output, and any derivatives thereof remain subject to [Google DeepMind’s AlphaGenome Model Terms](https://deepmind.google.com/science/alphagenome/model-terms).

This port is licensed under the Apache License, Version 2.0 (Apache 2.0):

>Copyright 2026 Danila Bredikhin, Martin Kjellberg, Christopher Zou, Alejandro Buendia, Xinming Tu, Anshul Kundaje

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

