Metadata-Version: 2.4
Name: FedGWAS
Version: 0.3.1
Summary: Federated genome-wide association study pipeline built with Flower and PLINK
Project-URL: Homepage, https://github.com/sitaomin1994/FedGWAS_pipeline
Project-URL: Repository, https://github.com/sitaomin1994/FedGWAS_pipeline
Project-URL: Issues, https://github.com/sitaomin1994/FedGWAS_pipeline/issues
Project-URL: Documentation, https://github.com/sitaomin1994/FedGWAS_pipeline#readme
Author: idsla
License: MIT
License-File: LICENSE
Keywords: bioinformatics,federated-learning,flower,gwas,plink
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11
Requires-Dist: flwr[simulation]<1.20,>=1.19.0
Requires-Dist: mkdocs>=1.6.1
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas-plink>=2.3.1
Requires-Dist: pandas>=2.2.2
Requires-Dist: phe>=1.5.0
Requires-Dist: pycryptodomex>=3.19.0
Requires-Dist: pyplink>=1.3.7
Requires-Dist: pysnptools>=0.5.13
Requires-Dist: scipy>=1.9.0
Provides-Extra: dev
Requires-Dist: flake8-docstrings>=1.7.0; extra == 'dev'
Requires-Dist: flake8>=7.1.0; extra == 'dev'
Requires-Dist: isort>=5.13.2; extra == 'dev'
Requires-Dist: mypy>=1.10.1; extra == 'dev'
Requires-Dist: pre-commit>=3.7.1; extra == 'dev'
Requires-Dist: pylint>=3.2.5; extra == 'dev'
Requires-Dist: pytest>=8.2.2; extra == 'dev'
Requires-Dist: tox>=4.16.0; extra == 'dev'
Description-Content-Type: text/markdown

# Federated GWAS Pipeline

This repository implements a federated pipeline for Genome-Wide Association Studies (GWAS) using Flower, PLINK, and custom privacy-preserving protocols. The pipeline supports multi-stage, multi-client GWAS with reproducible outputs and structured logging.

For release verification steps, see [RELEASE.md](RELEASE.md). For implementation details and change history, see [CURRENT_VERSION.md](CURRENT_VERSION.md).

---

## Environment Setup

### Option 1: UV (recommended)

Install [uv](https://docs.astral.sh/uv/):

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Sync dependencies (Python 3.11+):

```bash
uv sync --python 3.11
```

Optional dev dependencies:

```bash
uv sync --dev
```

### Option 2: Conda

```bash
conda create -n fedgwas python=3.11 -y
conda activate fedgwas
pip install -e .
pip install -U "flwr[simulation]"
```

---

## PLINK

- Requires [PLINK 1.9+](https://www.cog-genomics.org/plink/1.9/).
- Download the binary for your OS and ensure `plink` is on your `PATH`, or set the path in each client `config.yaml` (`plink.path` if configured).
- Toy reference files are under `plink/`; production runs use experiment data under `experiments/`.

---

## Quick Start (Recommended: tiny_even)

The default Flower config in `pyproject.toml` points to `experiments/correctness/tiny_even/configs` (2 clients, tiny synthetic data).

### Repository layout (experiments)

```
experiments/correctness/tiny_even/
├── config.yaml
├── configs/
│   ├── server/config.yaml
│   ├── center_1/config.yaml
│   └── center_2/config.yaml
├── data/tiny/
│   ├── center_1/          # PLINK .bed/.bim/.fam per client
│   ├── center_2/
│   └── centralized_baseline/   # after generate_baseline
└── results_2/             # gitignored; current shipped config output
```

Config templates: [configs/config_template.yaml](configs/config_template.yaml).

### 1. Generate synthetic data (if not present)

```bash
python pipeline/simulation/simulated_data/generate_synthetic_data.py \
  --scale tiny \
  --partition-strategy even \
  --seed 42 \
  --output-dir experiments/correctness/tiny_even/data
```

### 2. Generate centralized baseline

```bash
python experiments/tools/generate_baseline.py \
  experiments/correctness/tiny_even/config.yaml
```

### 3. Run federated pipeline (simulation)

```bash
flwr run . local-simulation --stream
```

Override rounds or config path:

```bash
flwr run . local-simulation --stream --run-config \
  'simulation=true num-server-rounds=100 config_path="experiments/correctness/tiny_even/configs"'
```

Results are written under each client's `logs/` and `intermediate/` directories (paths set in per-center `config.yaml`). The shipped tiny configs currently write under `experiments/correctness/tiny_even/results_2/`; use the paths in the active center and server config files as the source of truth.

### 4. Retention (optional, automatic)

Experiment `config.yaml` may set `retention.tier` (`minimal` | `standard` | `research`). When `auto_apply_on_complete: true`, the server prunes non-essential artifacts after the run. Manual:

```bash
python experiments/tools/apply_run_retention.py \
  experiments/correctness/tiny_even/results \
  --config-path experiments/correctness/tiny_even/configs \
  --dry-run
```

See [RELEASE.md](RELEASE.md) for tier definitions.

### 5. Evaluate against baseline

```bash
python experiments/tools/evaluation/evaluate_all.py \
  experiments/correctness/tiny_even/results_2 \
  --baseline experiments/correctness/tiny_even/data/tiny/centralized_baseline \
  --king
```

See [experiments/correctness/tiny_even/README.md](experiments/correctness/tiny_even/README.md) for expected metrics and success criteria. If you changed the output paths in the active configs, pass that results directory instead.

---

## Documentation Site

The Docusaurus site is isolated under `website/` and reads Markdown from the repository-level `docs/` directory.

```bash
cd website
npm install
npm run start
npm run build
```

---

## Three-Node Cluster Deployment

For Matpool or any 3-node layout (1 SuperLink + 2 SuperNodes), use the bundled scripts and guide:

- **Guide:** [cluster_deployment/docs/CLUSTER_USER_GUIDE.md](cluster_deployment/docs/CLUSTER_USER_GUIDE.md)
- **Scripts:** [cluster_deployment/README.md](cluster_deployment/README.md)

```bash
bash cluster_deployment/scripts/setup-cluster-node.sh   # each node
bash cluster_deployment/scripts/cluster-verify-data.sh --scale tiny --client-id 1  # each client
cluster_deployment/scripts/cluster-run-app.sh \
  --server-ip <SERVER_IP> --scale tiny --rounds 20
```

Performance scales (small/medium): `experiments/performance/scales.yaml` and per-scale READMEs under `small_even/`, `medium_even/`.

---

## Local Deployment Mode

Requires SuperLink + two SuperNodes + `flwr run`:

```bash
flower-superlink --insecure
```

```bash
flower-supernode --insecure --superlink 127.0.0.1:9092 --clientappio-api-address 127.0.0.1:9094 \
  --node-config 'partition-id=0 num-partitions=2 config-file="experiments/correctness/tiny_even/configs/center_1/config.yaml"'
```

```bash
flower-supernode --insecure --superlink 127.0.0.1:9092 --clientappio-api-address 127.0.0.1:9095 \
  --node-config 'partition-id=1 num-partitions=2 config-file="experiments/correctness/tiny_even/configs/center_2/config.yaml"'
```

```bash
flwr run . local-deployment --stream
```

---

## Advanced: Real-World Experiments

Larger studies (e.g. 1000 Genomes subset) live under `experiments/real_world/1000genomes/`. These require downloading/preparing data, longer runtime, and overriding `config_path`:

```bash
flwr run . local-simulation --stream --run-config \
  'config_path="experiments/real_world/1000genomes/configs"'
```

Manuscript figures and prior run outputs under `experiments/real_world/1000genomes/manuscript/` are research artifacts and are not required for the default release path.

---

## Output and Logs

- Per-client `intermediate_dir` and `log_dir` are defined in each center `config.yaml`.
- Directories are cleared at the start of each client run to avoid stale artifacts.
- Stage progress and errors go to per-client log files under each configured `output.log_dir`.
- Inspect PLINK outputs (`.assoc.logistic`, `.imiss`, `.frq`, KING kinship files) directly under each client's `logs/`.

---

## Federated Protocol (Summary)

1. **Key exchange** — ECC public keys via server relay  
2. **Sync** — Encrypted seed broadcast (server cannot decrypt)  
3. **Local / global QC** — Encrypted QC shares; exclusion list computed client-side  
4. **Iterative KING** — Chunked kinship with cross-client anonymized IDs  
5. **Local LR + filtering** — Tokenized insignificant SNPs  
6. **Iterative LR** — Chunked association on filtered data  

Full stage contracts and privacy model: [CURRENT_VERSION.md](CURRENT_VERSION.md).

---

## Troubleshooting

- **PLINK not found** — Install PLINK 1.9+ and verify `plink` is on `PATH` or configured in `config.yaml`.  
- **Wrong config** — Check `config_path` in `pyproject.toml` or pass `--run-config`.  
- **Empty results** — Ensure data and baseline exist under `experiments/correctness/tiny_even/data/`.  
- **Reproducibility** — Use fixed seeds in data generation and consistent `config_path` across runs.

---

## Contributing

Open issues or pull requests for bug fixes, improvements, or new features.

## Acknowledgments

Built with [Flower](https://flower.dev/), [PLINK](https://www.cog-genomics.org/plink/1.9/), and open-source Python tools.
