Metadata-Version: 2.4
Name: graphgwas
Version: 0.1.3
Summary: Relational fine-mapping of causal GWAS variants on a multi-omics knowledge graph
Author: Ehsan Estaji, Shi-Wei Zhao, Zhao-Yang Chen, Shuai Nie
Author-email: Jianfeng Mao <jianfeng.mao@umu.se>
Maintainer-email: Jianfeng Mao <jianfeng.mao@umu.se>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jfmao/GraphGWAS
Project-URL: Documentation, https://github.com/jfmao/GraphGWAS/tree/main/docs
Project-URL: Repository, https://github.com/jfmao/GraphGWAS
Project-URL: Issues, https://github.com/jfmao/GraphGWAS/issues
Project-URL: Changelog, https://github.com/jfmao/GraphGWAS/blob/main/CHANGELOG.md
Keywords: gwas,fine-mapping,graph-database,neo4j,multi-omics,population-genetics,bayesian,susie,finemap,polygenic
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1
Requires-Dist: neo4j>=5.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: pandas>=2.0
Requires-Dist: statsmodels>=0.14
Requires-Dist: networkx>=3.0
Requires-Dist: matplotlib>=3.7
Requires-Dist: requests>=2.31
Provides-Extra: sumstats
Requires-Dist: pysam>=0.22; extra == "sumstats"
Provides-Extra: genotypes
Requires-Dist: cyvcf2>=0.30; extra == "genotypes"
Requires-Dist: bgen>=1.7; extra == "genotypes"
Provides-Extra: api
Requires-Dist: fastapi>=0.110; extra == "api"
Requires-Dist: uvicorn[standard]>=0.27; extra == "api"
Requires-Dist: pydantic>=2.0; extra == "api"
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == "mcp"
Provides-Extra: gnn
Requires-Dist: torch>=2.0; extra == "gnn"
Requires-Dist: torch-geometric>=2.4; extra == "gnn"
Requires-Dist: scikit-learn>=1.3; extra == "gnn"
Provides-Extra: hail
Requires-Dist: hail>=0.2.130; extra == "hail"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Provides-Extra: all
Requires-Dist: graphgwas[api,dev,genotypes,gnn,mcp,sumstats]; extra == "all"
Dynamic: license-file

# GraphGWAS

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/)
[![Citation](https://img.shields.io/badge/cite-CITATION.cff-green.svg)](CITATION.cff)

**Relational fine-mapping of causal GWAS variants on a multi-omics knowledge graph**

GraphGWAS is a graph-native fine-mapping platform built on Neo4j. It carries
multi-omics biological structure — genes, tissue-specific eQTLs, pathways,
protein–protein interactions — *through* the fine-mapping inference as a typed
factor graph, rather than collapsing it to flat per-variant annotation priors
as existing Bayesian fine-mappers do. This relational prior matches the accuracy
of SuSiE / FINEMAP / SuSiE-inf / FINEMAP-inf / SBayesRC at 6–60× the speed under
strong signal, and wins **27–2 head-to-head against SuSiE at weak signal with
tissue-specific eQTL priors**.

## Key features

- **Two new fine-mapping algorithms** with theoretical guarantees
  - **HBP** — hierarchical belief propagation on a variant→gene→pathway factor
    graph with PPI coupling; proved Banach contraction (Theorem 2); 0.02–0.08 s
    per locus
  - **GAFM** (Graph-Augmented Fine-Mapping) — LD-deconvolved evidence combined
    with a graph functional score via adaptive α; proved causal-variant ranking
    under mild LD-decay assumptions (Theorem 3)
- **Six head-to-head baselines** integrated into a common interface — SuSiE,
  FINEMAP, SuSiE-inf, FINEMAP-inf, PolyFun-proxy, SBayesRC
- **Calibrated PIPs** with 0% null false-positive rate across 100 simulations
- **Multi-omics graph** — 70.7 M variants, 20,092 GENCODE genes, 43.2 M
  GTEx v8 tissue eQTLs, 230,850 STRING interactions (combined score ≥ 700),
  370,000 ENCODE cCREs
- **Biobank-scale** — sumstats-only entry path consumes Pan-UK Biobank summary
  statistics directly via tabix over HTTPS; demonstrated on 4 ancestries
  (EUR N = 420,531; CSA, AFR, EAS)
- **Cross-species** — same codebase applies to yeast, human, *Arabidopsis*
- **Unified package** with 52-command CLI, 37-endpoint FastAPI server, and
  16-tool MCP server for AI-agent access

## Quick start

```bash
# Install
git clone https://github.com/jfmao/GraphGWAS.git
cd GraphGWAS/src/python && pip install -e '.[all]'

# Run fine-mapping from Pan-UKB summary statistics (no Neo4j required)
python -c "
from graphgwas.panukb import fetch_sumstats_locus
from graphgwas.finemapping_v2 import hbp_finemap_from_sumstats
# Fetch BMI sumstats near FTO (GRCh37)
sumstats = fetch_sumstats_locus(
    phenocode='21001', chr='16',
    start=53720000, end=53920000,
    trait_type='continuous', modifier='irnt',
    ancestries=['EUR', 'CSA', 'AFR', 'EAS'],
)
print({anc: len(s.variants) for anc, s in sumstats.items()})
"

# Full pipeline with Neo4j + multi-omics graph:
# (1) Start Neo4j with the pre-built human dump (17 GB, from Zenodo)
# (2) Run GAFM fine-mapping on a lead variant
graphgwas finemap --chr 16 --pos 53820527 --window 100000 \
    --phenotype BMI --method l1 -o credible_set.tsv
```

## The graph schema

```
 Variant ──HAS_CONSEQUENCE──> Gene ──IN_PATHWAY──> Pathway
    │                           │
    ├── (af, qual, gt_packed)   ├── INTERACTS_WITH (STRING PPI ≥ 700)
    ├── eQTL ─────────────> Gene (tissue-specific, GTEx v8)
    ├── IN_REGULATORY ─────> RegulatoryElement  (ENCODE cCRE)
    └── FOR_VARIANT <─── AssociationResult ──IN_STUDY──> GWASStudy
```

The credible-set output is itself a graph object: each reported variant is
co-queryable with its gene, tissue and pathway neighbours in a single Cypher
traversal, eliminating the post-hoc enrichment step that flat-prior pipelines
require.

## Three interfaces

| Interface | Use case | Entry point |
|---|---|---|
| **CLI** (52 commands, 15 groups) | interactive analysis, scripted pipelines | `graphgwas ...` |
| **REST API** (FastAPI, 37 endpoints) | web integration, programmatic access | `graphgwas api serve` |
| **MCP server** (FastMCP, 16 tools) | AI-agent access via any MCP-compatible client | `graphgwas mcp` |

Full documentation in [`docs/manual/`](docs/manual/index.md); end-to-end
walkthrough in [`vignettes/fine-mapping-quickstart.md`](vignettes/fine-mapping-quickstart.md).

## Fine-mapping methods at a glance

| Method | Complexity | Typical runtime / locus | Wins vs SuSiE at |
|---|---|---|---|
| **HBP** (three-layer factor graph + Banach contraction) | O(E × T) | 0.02–0.08 s | accuracy parity; 6–60× faster |
| **GAFM** (LD-deconvolved + adaptive α + graph prior) | O(n²) | 0.07 s | 27–2 at weak signal + tissue-specific eQTL priors |
| **CLGF** (cross-locus EM) | O(L × T) | locus-dependent | multi-locus shared-pathway evidence |
| **L4** (MDS embedding) | O(n² + n d) | 0.1 s | multi-signal detection |

## Documentation

- [`docs/INSTALL.md`](docs/INSTALL.md) — detailed installation guide
  (Neo4j, Python env, Hail for Pan-UKB LD, optional GNN deps)
- [`docs/manual/index.md`](docs/manual/index.md) — full CLI reference
  (52 commands across 15 groups)
- [`vignettes/fine-mapping-quickstart.md`](vignettes/fine-mapping-quickstart.md) — 15-min Pan-UKB sumstats → credible set
- [`vignettes/full-1kg-pipeline.md`](vignettes/full-1kg-pipeline.md) — 4–6 h end-to-end: raw 1000 Genomes VCF → GWAS → fine-mapping → graph-queryable credible set
- [`docs/MATHEMATICAL_PROOFS.md`](docs/MATHEMATICAL_PROOFS.md) — theorems 1–5
- [`docs/REPRODUCIBILITY.md`](docs/REPRODUCIBILITY.md) — regenerate every
  paper figure and table from a single command

## Platform scope beyond fine-mapping

GraphGWAS is a platform of which fine-mapping is the first method class
rigorously benchmarked (see the accompanying Nature Genetics paper). The
codebase additionally implements:

- Epistasis (M1 LD-pruned, M2 motif-filtered, M3 differential-subgraph,
  M4 dark-matter pairs) — companion manuscript in preparation
- Heritability (6 estimators including spectral, GRM-REML, conductance)
- Multivariate cross-trait analysis (r_G, G-matrix, coherence, pleiotropy)
- Polygenic risk scores (classical + pathway-weighted)
- Mendelian randomisation (IVW, Egger, weighted median)
- Gene–environment interactions (multi-environment trials)
- Heterogeneous GNN (PyTorch Geometric) and LangGraph AI-agent interface

Honest benchmark-status table in Supplementary Note S3 of the manuscript.

## Data

Pre-built Neo4j graph databases on Zenodo (DOIs assigned on acceptance):

| Dataset | Size | Contents |
|---|---|---|
| Human 1KG + multi-omics | 17 GB | 70.7 M variants, 3,202 samples, 20,092 genes, 43.2 M GTEx eQTLs, 230 K STRING PPIs, 370 K ENCODE cCREs |
| Yeast 1011 Genomes | 0.5 GB | 1.92 M variants, 1,011 strains, SGD gene annotations, 35 growth-trait phenotypes |

Pan-UKB summary statistics are streamed on demand via tabix over HTTPS from
the public Amazon S3 bucket `pan-ukb-us-east-1`; no authentication or bulk
download required.

## Citation

If you use GraphGWAS, please cite the accompanying Nature Genetics
manuscript (*Relational biological structure improves fine-mapping of causal
GWAS variants under weak signal*, submitted 2026) and the Zenodo-versioned
software release. See [`CITATION.cff`](CITATION.cff).

## License

MIT — see [`LICENSE`](LICENSE).
