Metadata-Version: 2.4
Name: diffbio
Version: 0.1.0
Summary: DiffBio: End-to-end differentiable bioinformatics pipelines built on Datarax, Artifex, Opifex, and Calibrax
Project-URL: Bug Tracker, https://github.com/avitai/DiffBio/issues
Project-URL: Documentation, https://diffbio.readthedocs.io
Project-URL: Source, https://github.com/avitai/DiffBio
Author: Mahdi Shafiei
License: MIT License
        
        Copyright (c) 2026 Mahdi Shafiei
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: alignment,bioinformatics,differentiable,flax,jax,machine-learning,variant-calling
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: anndata>=0.9.1
Requires-Dist: avitai-artifex>=0.1.0
Requires-Dist: beartype>=0.14.1
Requires-Dist: biopython>=1.81
Requires-Dist: calibrax>=0.1.1
Requires-Dist: chex>=0.1.7
Requires-Dist: datarax>=0.1.3
Requires-Dist: flax>=0.12.0
Requires-Dist: h5py>=3.7
Requires-Dist: jax-md>=0.2.27
Requires-Dist: jax>=0.6.1
Requires-Dist: jaxtyping>=0.2.20
Requires-Dist: numpy>=1.24
Requires-Dist: opifex>=0.1.0
Requires-Dist: optax>=0.1.4
Requires-Dist: orbax-checkpoint>=0.11.10
Requires-Dist: pre-commit>=4.3.0
Requires-Dist: rdkit>=2025.9.3
Requires-Dist: ruff>=0.1.5
Requires-Dist: scipy>=1.10
Provides-Extra: all
Requires-Dist: bandit[toml]>=1.8.6; extra == 'all'
Requires-Dist: beartype>=0.14.1; extra == 'all'
Requires-Dist: build>=1.0.3; extra == 'all'
Requires-Dist: coverage>=7; extra == 'all'
Requires-Dist: deepchem>=2.8.0; extra == 'all'
Requires-Dist: flake8-functions-names>=0.4; extra == 'all'
Requires-Dist: flake8>=7.0; extra == 'all'
Requires-Dist: griffe>=1.7.3; extra == 'all'
Requires-Dist: import-linter>=2.5; extra == 'all'
Requires-Dist: interrogate>=1.7.0; extra == 'all'
Requires-Dist: ipykernel>=6.29.5; extra == 'all'
Requires-Dist: jax[cuda12]>=0.6.1; extra == 'all'
Requires-Dist: jaxlib>=0.6.1; extra == 'all'
Requires-Dist: lineax>=0.0.8; extra == 'all'
Requires-Dist: matplotlib>=3.7; extra == 'all'
Requires-Dist: mkdocs-include-exclude-files>=0.1; extra == 'all'
Requires-Dist: mkdocs-material>=9.6.7; extra == 'all'
Requires-Dist: mkdocs>=1.6.1; extra == 'all'
Requires-Dist: mkdocstrings-python>=1.1.2; extra == 'all'
Requires-Dist: mkdocstrings>=0.28.3; extra == 'all'
Requires-Dist: optimistix>=0.0.9; extra == 'all'
Requires-Dist: ott-jax>=0.5.0; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: pyfaidx>=0.8.0; extra == 'all'
Requires-Dist: pylint>=3.3.8; extra == 'all'
Requires-Dist: pymdown-extensions>=10.14.3; extra == 'all'
Requires-Dist: pynndescent>=0.5; extra == 'all'
Requires-Dist: pyright>=1.1.336; extra == 'all'
Requires-Dist: pysam>=0.22.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.23; extra == 'all'
Requires-Dist: pytest-benchmark>=4; extra == 'all'
Requires-Dist: pytest-cov>=6.1.1; extra == 'all'
Requires-Dist: pytest-env>=1.0.1; extra == 'all'
Requires-Dist: pytest-json-report>=1.5.0; extra == 'all'
Requires-Dist: pytest-randomly>=3.16.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.1; extra == 'all'
Requires-Dist: pytest-xdist>=3.6; extra == 'all'
Requires-Dist: pytest>=8.3.5; extra == 'all'
Requires-Dist: python-dotenv>=1; extra == 'all'
Requires-Dist: radon>=6.0.1; extra == 'all'
Requires-Dist: ruff>=0.1.5; extra == 'all'
Requires-Dist: scib-metrics>=0.5; extra == 'all'
Requires-Dist: shellcheck-py>=0.10.0.1; extra == 'all'
Requires-Dist: squidpy>=1.4; extra == 'all'
Requires-Dist: tabulate>=0.9; extra == 'all'
Requires-Dist: torch>=1.13.0; extra == 'all'
Requires-Dist: wemake-python-styleguide>=1.0; extra == 'all'
Provides-Extra: benchmark
Requires-Dist: deepchem>=2.8.0; extra == 'benchmark'
Requires-Dist: matplotlib>=3.7; extra == 'benchmark'
Requires-Dist: pandas>=2.0; extra == 'benchmark'
Requires-Dist: pynndescent>=0.5; extra == 'benchmark'
Requires-Dist: scib-metrics>=0.5; extra == 'benchmark'
Requires-Dist: squidpy>=1.4; extra == 'benchmark'
Requires-Dist: tabulate>=0.9; extra == 'benchmark'
Provides-Extra: cuda-dev
Requires-Dist: bandit[toml]>=1.8.6; extra == 'cuda-dev'
Requires-Dist: build>=1.0.3; extra == 'cuda-dev'
Requires-Dist: coverage>=7; extra == 'cuda-dev'
Requires-Dist: flake8-functions-names>=0.4; extra == 'cuda-dev'
Requires-Dist: flake8>=7.0; extra == 'cuda-dev'
Requires-Dist: import-linter>=2.5; extra == 'cuda-dev'
Requires-Dist: interrogate>=1.7.0; extra == 'cuda-dev'
Requires-Dist: ipykernel>=6.29.5; extra == 'cuda-dev'
Requires-Dist: jax[cuda12]>=0.6.1; extra == 'cuda-dev'
Requires-Dist: jaxlib>=0.6.1; extra == 'cuda-dev'
Requires-Dist: pylint>=3.3.8; extra == 'cuda-dev'
Requires-Dist: pyright>=1.1.336; extra == 'cuda-dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'cuda-dev'
Requires-Dist: pytest-benchmark>=4; extra == 'cuda-dev'
Requires-Dist: pytest-cov>=6.1.1; extra == 'cuda-dev'
Requires-Dist: pytest-env>=1.0.1; extra == 'cuda-dev'
Requires-Dist: pytest-json-report>=1.5.0; extra == 'cuda-dev'
Requires-Dist: pytest-randomly>=3.16.0; extra == 'cuda-dev'
Requires-Dist: pytest-timeout>=2.1; extra == 'cuda-dev'
Requires-Dist: pytest-xdist>=3.6; extra == 'cuda-dev'
Requires-Dist: pytest>=8.3.5; extra == 'cuda-dev'
Requires-Dist: python-dotenv>=1; extra == 'cuda-dev'
Requires-Dist: radon>=6.0.1; extra == 'cuda-dev'
Requires-Dist: ruff>=0.1.5; extra == 'cuda-dev'
Requires-Dist: shellcheck-py>=0.10.0.1; extra == 'cuda-dev'
Requires-Dist: wemake-python-styleguide>=1.0; extra == 'cuda-dev'
Provides-Extra: dev
Requires-Dist: bandit[toml]>=1.8.6; extra == 'dev'
Requires-Dist: build>=1.0.3; extra == 'dev'
Requires-Dist: coverage>=7; extra == 'dev'
Requires-Dist: flake8-functions-names>=0.4; extra == 'dev'
Requires-Dist: flake8>=7.0; extra == 'dev'
Requires-Dist: import-linter>=2.5; extra == 'dev'
Requires-Dist: interrogate>=1.7.0; extra == 'dev'
Requires-Dist: ipykernel>=6.29.5; extra == 'dev'
Requires-Dist: pylint>=3.3.8; extra == 'dev'
Requires-Dist: pyright>=1.1.336; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-benchmark>=4; extra == 'dev'
Requires-Dist: pytest-cov>=6.1.1; extra == 'dev'
Requires-Dist: pytest-env>=1.0.1; extra == 'dev'
Requires-Dist: pytest-json-report>=1.5.0; extra == 'dev'
Requires-Dist: pytest-randomly>=3.16.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.1; extra == 'dev'
Requires-Dist: pytest-xdist>=3.6; extra == 'dev'
Requires-Dist: pytest>=8.3.5; extra == 'dev'
Requires-Dist: python-dotenv>=1; extra == 'dev'
Requires-Dist: radon>=6.0.1; extra == 'dev'
Requires-Dist: ruff>=0.1.5; extra == 'dev'
Requires-Dist: shellcheck-py>=0.10.0.1; extra == 'dev'
Requires-Dist: wemake-python-styleguide>=1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: griffe>=1.7.3; extra == 'docs'
Requires-Dist: mkdocs-include-exclude-files>=0.1; extra == 'docs'
Requires-Dist: mkdocs-material>=9.6.7; extra == 'docs'
Requires-Dist: mkdocs>=1.6.1; extra == 'docs'
Requires-Dist: mkdocstrings-python>=1.1.2; extra == 'docs'
Requires-Dist: mkdocstrings>=0.28.3; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.14.3; extra == 'docs'
Provides-Extra: genomics
Requires-Dist: pyfaidx>=0.8.0; extra == 'genomics'
Requires-Dist: pysam>=0.22.0; extra == 'genomics'
Provides-Extra: gpu
Requires-Dist: jax[cuda12]>=0.6.1; extra == 'gpu'
Requires-Dist: jaxlib>=0.6.1; extra == 'gpu'
Provides-Extra: soft-ops-advanced
Requires-Dist: lineax>=0.0.8; extra == 'soft-ops-advanced'
Requires-Dist: optimistix>=0.0.9; extra == 'soft-ops-advanced'
Provides-Extra: soft-ops-ot
Requires-Dist: lineax>=0.0.8; extra == 'soft-ops-ot'
Requires-Dist: optimistix>=0.0.9; extra == 'soft-ops-ot'
Requires-Dist: ott-jax>=0.5.0; extra == 'soft-ops-ot'
Provides-Extra: test
Requires-Dist: beartype>=0.14.1; extra == 'test'
Requires-Dist: coverage>=7; extra == 'test'
Requires-Dist: pytest-asyncio>=0.23; extra == 'test'
Requires-Dist: pytest-benchmark>=4; extra == 'test'
Requires-Dist: pytest-cov>=6.1.1; extra == 'test'
Requires-Dist: pytest-env>=1.0.1; extra == 'test'
Requires-Dist: pytest-randomly>=3.16.0; extra == 'test'
Requires-Dist: pytest-timeout>=2.1; extra == 'test'
Requires-Dist: pytest-xdist>=3.6; extra == 'test'
Requires-Dist: pytest>=8.3.5; extra == 'test'
Provides-Extra: torch-io
Requires-Dist: torch>=1.13.0; extra == 'torch-io'
Description-Content-Type: text/markdown

# DiffBio

<p align="center">
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+"></a>
  <a href="https://jax.readthedocs.io/"><img src="https://img.shields.io/badge/JAX-0.6.1+-green.svg" alt="JAX"></a>
  <a href="https://flax.readthedocs.io/"><img src="https://img.shields.io/badge/Flax-0.12+-orange.svg" alt="Flax"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License"></a>
</p>

<p align="center">
  <strong>End-to-End Differentiable Bioinformatics Pipelines</strong>
</p>

<p align="center">
  Built on <a href="https://github.com/avitai/datarax">Datarax</a>, <a href="https://github.com/avitai/artifex">Artifex</a>, <a href="https://github.com/avitai/Opifex">Opifex</a>, and <a href="https://github.com/avitai/calibrax">Calibrax</a> | Powered by <a href="https://jax.readthedocs.io/">JAX</a> & <a href="https://flax.readthedocs.io/">Flax NNX</a>
</p>

---

## Overview

DiffBio is a framework for building **end-to-end differentiable bioinformatics
pipelines**. By replacing discrete operations with differentiable relaxations,
DiffBio enables gradient-based optimization through entire analysis workflows.

DiffBio is the biology-specific differentiable operator layer of a wider
JAX/NNX scientific ML ecosystem. It uses:

- **Datarax** for operator and dataflow contracts
- **Artifex** for reusable model-building and transformer components
- **Opifex** for scientific ML and advanced optimization primitives
- **Calibrax** for metrics, benchmarking, comparison, and regression control

Traditional bioinformatics pipelines use discrete operations (hard thresholds, argmax decisions) that block gradient flow. DiffBio addresses this by:

- **Soft quality filtering** using sigmoid-based weights instead of hard cutoffs
- **Differentiable pileup** with soft position assignments via temperature-controlled softmax
- **Soft alignment scoring** replacing discrete Smith-Waterman with continuous relaxations
- **End-to-end training** of complete pipelines using gradient descent

This enables learning optimal pipeline parameters directly from data, rather than manual tuning.

## Features

- **40+ Differentiable Operators** covering alignment, variant calling, single-cell analysis, epigenomics, RNA-seq, preprocessing, normalization, multi-omics, drug discovery, and protein/RNA structure
- **6 End-to-End Pipelines** for variant calling, enhanced variant calling, single-cell analysis, differential expression, perturbation, and preprocessing
- **GPU-Accelerated** computation via JAX's XLA compilation
- **Composable Architecture** built on the Datarax, Artifex, Opifex, and Calibrax stack
- **Training Utilities** with gradient clipping, custom loss functions, and synthetic data generation

For complete operator and pipeline listings, see the [Operators Overview](https://docs.avitai.bio/diffbio/user-guide/operators/overview/) and [Pipelines Overview](https://docs.avitai.bio/diffbio/user-guide/pipelines/overview/) in the documentation.

## Installation

```bash
# Clone the repository
git clone https://github.com/avitai/DiffBio.git
cd DiffBio

# Install with uv
uv sync
```

## Quick Start

### Using Individual Operators

```python
import jax
import jax.numpy as jnp
from flax import nnx

from diffbio.operators import DifferentiableQualityFilter
from diffbio.operators.variant.pileup import DifferentiablePileup
from diffbio.operators.alignment.smith_waterman import SmoothSmithWaterman

# Quality filtering with learnable threshold
quality_filter = DifferentiableQualityFilter(
    threshold=20.0,
    temperature=1.0,
    rngs=nnx.Rngs(0),
)

# Apply to reads
quality_scores = jnp.array([35.0, 15.0, 28.0, 10.0])
reads = jax.nn.one_hot(jnp.array([[0, 1, 2, 3]] * 4), 4)
data = {"reads": reads, "quality": quality_scores}

filtered_data, _, _ = quality_filter.apply(data, {}, None)
# filtered_data["weights"] contains soft weights for each read
```

### Using the Variant Calling Pipeline

```python
from diffbio.pipelines import (
    VariantCallingPipeline,
    VariantCallingPipelineConfig,
    create_variant_calling_pipeline,
)

# Create pipeline with default configuration
pipeline = create_variant_calling_pipeline(
    reference_length=100,
    num_classes=3,  # ref, SNP, indel
    hidden_dim=32,
    seed=42,
)

# Process reads
batch_data = {
    "reads": reads,           # (num_reads, read_length, 4)
    "positions": positions,   # (num_reads,)
    "quality": quality,       # (num_reads, read_length)
}

result, _, _ = pipeline.apply(batch_data, {}, None)
# result["logits"] contains per-position variant predictions
# result["probabilities"] contains class probabilities
```

### Training a Pipeline

```python
from diffbio.utils import (
    Trainer,
    TrainingConfig,
    cross_entropy_loss,
    create_synthetic_training_data,
    data_iterator,
)

# Generate synthetic training data
inputs, targets = create_synthetic_training_data(
    num_samples=100,
    num_reads=10,
    read_length=50,
    reference_length=100,
    variant_rate=0.1,
)

# Configure training
config = TrainingConfig(
    learning_rate=1e-3,
    num_epochs=50,
    log_every=10,
    grad_clip_norm=1.0,
)

# Create trainer
trainer = Trainer(pipeline, config)

# Define loss function
def loss_fn(predictions, targets):
    return cross_entropy_loss(
        predictions["logits"],
        targets["labels"],
        num_classes=3,
    )

# Train
trainer.train(
    data_iterator_fn=lambda: data_iterator(inputs, targets),
    loss_fn=loss_fn,
)

# Access trained pipeline
trained_pipeline = trainer.pipeline
```

## Architecture

DiffBio sits on a layered ecosystem rather than standing alone:

| Layer | Library | Role In DiffBio |
|---|---|---|
| Execution contracts | [Datarax](https://github.com/avitai/datarax) | Operator, data-source, and pipeline contracts |
| Modeling substrate | [Artifex](https://github.com/avitai/artifex) | Reusable transformer and generative-model components |
| Scientific ML substrate | [Opifex](https://github.com/avitai/Opifex) | Scientific optimization, operator learning, and advanced training methods |
| Evaluation substrate | [Calibrax](https://github.com/avitai/calibrax) | Metrics, benchmarking, comparison, profiling, and regression checks |
| Biology-specific layer | DiffBio | Differentiable biological operators and domain compositions |

Each DiffBio operator inherits from Datarax's `OperatorModule` and implements:

```
apply(data, state, metadata) -> (output_data, output_state, output_metadata)
```

This enables:
- **Composition**: Chain operators into pipelines
- **Batch processing**: Automatic vectorization via `apply_batch()`
- **Gradient flow**: End-to-end differentiability through the pipeline

### Operator Composition

Operators are chained by threading the `(data, state, metadata)` triple
returned by `apply()` into the next operator:

```python
data, state, metadata = quality_filter.apply(batch_data, {}, None)
data, state, metadata = pileup.apply(data, state, metadata)
data, state, metadata = classifier.apply(data, state, metadata)

# `data` is a dict of JAX arrays — read out the per-position predictions
predictions = data["logits"]
```

## Testing

```bash
# Run all tests
uv run pytest -vv

# Run with coverage
uv run pytest -vv --cov=src/ --cov-report=term-missing

# Run specific test modules
uv run pytest tests/operators/ -vv
uv run pytest tests/pipelines/ -vv
uv run pytest tests/integration/ -vv
```

## Project Structure

```
DiffBio/
├── src/diffbio/
│   ├── core/               # Base operators, graph utils, soft ops
│   ├── operators/           # 35+ differentiable operators
│   │   ├── alignment/       # Smith-Waterman, profile HMM, soft MSA
│   │   ├── variant/         # Pileup, classifiers, CNV segmentation
│   │   ├── singlecell/      # Clustering, trajectory, velocity, GRN, ...
│   │   ├── drug_discovery/  # Fingerprints, property prediction, ADMET
│   │   ├── epigenomics/     # Peak calling, chromatin state
│   │   ├── normalization/   # VAE normalizer, UMAP, PHATE
│   │   ├── statistical/     # HMM, NB GLM, EM quantification
│   │   ├── multiomics/      # Hi-C, spatial deconvolution
│   │   └── ...              # preprocessing, protein, RNA, assembly, ...
│   ├── pipelines/           # End-to-end pipelines
│   ├── losses/              # Alignment, single-cell, statistical losses
│   ├── sources/             # Data loaders (FASTA, BAM, MolNet, ...)
│   ├── splitters/           # Dataset splitting strategies
│   └── utils/               # Training utilities
├── tests/                   # Unit, integration, and benchmark tests
├── benchmarks/              # Domain benchmarks with training + baselines
└── docs/                    # MkDocs documentation
```

## Requirements

- Python 3.11+
- JAX 0.6.1+
- Flax 0.12+
- Optax 0.1.4+
- jaxtyping 0.2.20+
- Datarax, Artifex, Opifex, and Calibrax (installed automatically from PyPI)

## License

MIT License. See [LICENSE](LICENSE) for details.

## Acknowledgments

DiffBio builds on ideas from:
- [SMURF](https://www.biorxiv.org/content/10.1101/2021.10.23.465204): Differentiable Smith-Waterman for end-to-end MSA learning
- [Datarax](https://github.com/avitai/datarax): Composable data processing framework
- [Artifex](https://github.com/avitai/artifex): Generative-model and transformer substrate
- [Opifex](https://github.com/avitai/Opifex): Scientific ML and advanced optimization substrate
- [Calibrax](https://github.com/avitai/calibrax): Benchmarking, comparison, and regression substrate
- [Flax NNX](https://flax.readthedocs.io/): Neural network library for JAX
