Metadata-Version: 2.4
Name: variantfold
Version: 1.0.0
Summary: Classify variants of uncertain significance using AlphaFold-predicted protein structures and graph neural networks.
Author: VariantFold Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/comparativechrono/VariantFold
Project-URL: Repository, https://github.com/comparativechrono/VariantFold
Project-URL: Issues, https://github.com/comparativechrono/VariantFold/issues
Keywords: bioinformatics,alphafold,variant classification,GNN,VUS,ACMG
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython>=1.80
Requires-Dist: biopandas>=0.4
Requires-Dist: numpy>=1.22
Requires-Dist: pandas>=1.4
Requires-Dist: scikit-learn>=1.0
Requires-Dist: torch>=2.0
Requires-Dist: torch-geometric>=2.3
Provides-Extra: structure
Requires-Dist: colabfold[alphafold-minus-jax]; extra == "structure"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.5; extra == "viz"
Requires-Dist: seaborn>=0.12; extra == "viz"
Requires-Dist: networkx>=2.8; extra == "viz"
Requires-Dist: py3Dmol>=1.8; extra == "viz"
Provides-Extra: dgl
Requires-Dist: dgl>=1.0; extra == "dgl"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Provides-Extra: all
Requires-Dist: variantfold[dev,dgl,structure,viz]; extra == "all"
Dynamic: license-file

# VariantFold

Classify **Variants of Uncertain Significance (VUS)** using AlphaFold-predicted protein structures and Graph Neural Networks.

VariantFold leverages protein structure predictions from ColabFold/AlphaFold and a Graph Convolutional Network (GCN) to classify VUS based on the standardised ACMG-AMP variant classification system.

## Workflow

```
ClinVar data → Parse variants → Mutate sequences → ColabFold 3-D prediction
    → PDB-to-graph conversion → Train GCN (benign vs pathogenic) → Classify VUS
```

1. **Parse** — Extract missense variants from ClinVar downloads (benign, pathogenic, VUS).
2. **Mutate** — Apply each variant to the reference protein sequence.
3. **Predict** — Run ColabFold to generate 3-D structure models for every variant.
4. **Convert** — Transform PDB files into PyTorch Geometric residue-level graphs with rich node features (one-hot amino acid, 3-D coordinates, pLDDT).
5. **Train** — Train a multi-layer GCN on the benign vs pathogenic graph dataset.
6. **Classify** — Run the trained model on VUS structures to predict likely benign / likely pathogenic with probabilities.

## Installation

```bash
# Core package (graph conversion + GCN training/inference)
pip install .

# With ColabFold for structure prediction (GPU recommended)
pip install ".[structure]"

# With visualisation tools
pip install ".[viz]"

# Everything
pip install ".[all]"
```

## Quick start — Python API

```python
from variantfold import VariantFoldConfig, VariantFoldPipeline

cfg = VariantFoldConfig(
    gene_symbol="VHL",
    entrez_email="your_email@example.com",
)

pipe = VariantFoldPipeline(cfg)
pipe.step1_parse_variants()       # Parse ClinVar files + fetch sequence
# pipe.step2_predict_structures() # Run ColabFold (long — needs GPU)
pipe.step3_collect_models()       # Gather best PDB models
metrics = pipe.step4_train()      # Train GCN
print(f"Test accuracy: {metrics['accuracy']:.2%}")

vus_df = pipe.step5_classify_vus()
print(vus_df)
```

## Quick start — CLI

```bash
# Run steps 1, 3, 4, 5 (assumes PDB libraries are already populated)
variantfold run --gene VHL --email you@example.com --steps 1,3,4,5

# Standalone inference on new PDB files
variantfold predict --model variantfold_VHL/variantfold_model.pt \
                    --pdb-dir ./new_vus_pdbs/
```

## Input data

Place these files in the working directory (`./variantfold_<gene>/`):

| File | Description |
|------|-------------|
| `clinvar_result_bng.txt` | ClinVar download filtered to **benign** variants |
| `clinvar_result_ptg.txt` | ClinVar download filtered to **pathogenic** variants |
| `clinvar_result_vus.txt` | ClinVar download filtered to **VUS** *(optional)* |

Download from [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) using the tab-delimited download with default settings.

## Configuration

All parameters are set via `VariantFoldConfig`:

```python
cfg = VariantFoldConfig(
    gene_symbol="TP53",
    entrez_email="you@example.com",
    distance_threshold=6.5,   # Å, residue contact cutoff
    gcn_hidden_dim=64,        # GCN layer width
    gcn_num_layers=3,         # depth
    epochs=200,
    learning_rate=0.01,
    train_fraction=0.8,
    use_residue_features=True, # 24-dim features (set False for legacy 1-dim)
)
```

## Development

```bash
pip install -e ".[dev]"
pytest
```

## Licence

MIT
