Metadata-Version: 2.4
Name: synomicsbench
Version: 1.0.1
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: anonymeter>=1.0.0
Requires-Dist: baycomp>=1.0.3
Requires-Dist: codecarbon>=3.2.5
Requires-Dist: copulas>=0.12.0
Requires-Dist: ctgan>=0.10.2
Requires-Dist: deptry>=0.25.1
Requires-Dist: gseapy>=1.1.12
Requires-Dist: ipykernel>=7.2.0
Requires-Dist: joblib>=1.5.3
Requires-Dist: jupyter>=1.1.1
Requires-Dist: lab>=8.9
Requires-Dist: lifelines>=0.30.3
Requires-Dist: lightgbm>=4.6.0
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: memory-profiler>=0.61.0
Requires-Dist: miceforest>=6.0.5
Requires-Dist: missingno>=0.5.2
Requires-Dist: mpl-tools>=0.4.1
Requires-Dist: mygene>=3.2.2
Requires-Dist: numba>=0.64.0
Requires-Dist: numpy>=1.26.4
Requires-Dist: pandas>=2.3.3
Requires-Dist: polar>=0.0.127
Requires-Dist: polars>=1.39.2
Requires-Dist: pytest>=9.0.3
Requires-Dist: rpy2==3.6.5
Requires-Dist: scienceplots>=2.2.1
Requires-Dist: scikit-bio>=0.7.0
Requires-Dist: scikit-learn>=1.8.0
Requires-Dist: scipy>=1.17.1
Requires-Dist: sdmetrics>=0.18.0
Requires-Dist: sdv>=1.18.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: statsmodels>=0.14.6
Requires-Dist: torch>=2.10.0
Requires-Dist: tqdm>=4.67.3
Description-Content-Type: text/markdown

# SynOmicsBench

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19660703.svg)](https://zenodo.org/records/19660703)
[![CI](https://github.com/trinhthechuong/SynOmicsBench/actions/workflows/ci.yml/badge.svg)](https://github.com/trinhthechuong/SynOmicsBench/actions/workflows/ci.yml)
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue.svg)](https://trinhthechuong.github.io/SynOmicsBench/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-red.svg)](https://www.python.org/downloads/release/python-3120/)


**SynOmicsBench** is a unified benchmarking framework for synthetic data generation (SDG) for clinical transcriptomic cancer cohorts.

Achieving a trade-off between **biological utility** and **patient privacy** is critical for secure data sharing when applying transcriptomic clinical datasets to artificial intelligence in precision oncology. Here, we present the **SynOmicsBench** framework. SynOmicsBench combines standardized preprocessing with multidimensional evaluation, prioritizing downstream biological validation alongside statistical fidelity and attack-based privacy assessment. This work provides a reproducible decision-support tool for method selection and promotes biologically informed, privacy-aware adoption of synthetic data in precision oncology.

---

## 🔬 Framework Overview

![Framework Overview](https://github.com/user-attachments/assets/2cf2423c-dc24-4f85-b97f-2160bfc9ebf4)

SynOmicsBench compares synthetic data generation methods using a standardized pipeline that combines:

- **Standardized Preprocessing**: Automated data filtering, harmonization, and integration.
- **Multidimensional Evaluation**: Assessing Statistical Fidelity, Downstream Biological Utility, and Privacy Risk.
- **State-of-the-Art SDG Methods**: Native support for **CTGAN, TVAE, Gaussian Copula, Synthpop, and Avatars (K5/K10)**. 

---

## 🛠 Installation

SynOmicsBench can be installed in three different ways depending on your environment. **Python 3.12+** is required.

### Option 1: From PyPI (Recommended)

```bash
pip install synomicsbench
```

### Option 2: From Source (GitHub)
We recommend using [`uv`](https://docs.astral.sh/uv/) for fast, reliable dependency management. This method uses the provided uv.lock file to ensure reproducible installations.
```bash
git clone https://github.com/trinhthechuong/SynOmicsBench.git
cd SynOmicsBench

# With uv (Fastest)
uv sync
source .venv/bin/activate

# Or with traditional pip
pip install -e .
```

### Option 3: Pre-built Container (Apptainer/Singularity)
For HPC environments or reproducible workflows, you can pull our fully prepared Apptainer container which contains all dependencies (including heavy ML frameworks and R):

```bash
# Pull the latest SynOmicsBench container
apptainer pull synomicsbench.sif oras://ghcr.io/trinhthechuong/synomicsbench:latest

# Verify the container is working and the package is ready
apptainer exec synomicsbench.sif python -c "import synomicsbench; print('OK: SynOmicsBench is ready!')"
```
*(To use the container for your scripts, simply mount your directories via `--bind` and run your Python scripts using `apptainer exec`)*

---

## 🚀 Quick Start

Here is a brief example of how to generate synthetic data with Gaussian Copula and evaluate its statistical fidelity:

```python
import pandas as pd
from synomicsbench.synthesizer.GaussianCopulasynthesizer import GaussianCopulasynthesizer
from synomicsbench.processing.metadata import MetaData
from synomicsbench.metrics.fidelity.UnivariateSimilarity import UnivariateSimilarity 

# 1. Load Data & Prepare Metadata
original_data = pd.read_csv("your_clinical_transcriptomic_data.csv")
ordinal_features = ["Mstage", "Tx_Start_ECOG", "numPriorTherapies"]
metadata = MetaData.get_metadata(data=original_data, ordinal_features=ordinal_features)

# 2. Generate Synthetic Data
synth = GaussianCopulasynthesizer(output_path="./results", metadata=metadata)
synthetic_data = synth.generate(
    data=original_data, 
    n_samples=original_data.shape[0]
)

# 3. Evaluate Fidelity
evaluator = UnivariateSimilarity(output_dir="./evaluation_results")
score = evaluator.get_univariate_score(
    original_data=original_data, 
    synthetic_data=synthetic_data, 
    metadata=metadata, 
    save=True
)
print(f"Overall Fidelity Score: {score:.4f}")
```

---

## 📚 Documentation

For complete API references, tutorials, and full benchmarking results, check out the **[SynOmicsBench Official Documentation](https://trinhthechuong.github.io/SynOmicsBench/)**:

- [**Getting Started**](https://trinhthechuong.github.io/SynOmicsBench/getting-started/): Step-by-step setup guides.
- [**Preprocessing Pipeline**](https://trinhthechuong.github.io/SynOmicsBench/preprocessing/): Harmonizing multimodal data.
- [**SDG Methods**](https://trinhthechuong.github.io/SynOmicsBench/synthetic-data/): Deep dive into generation models.
- [**Evaluation Framework**](https://trinhthechuong.github.io/SynOmicsBench/evaluation/): Understand our metrics for Privacy and Biological signal preservation.

---

## 📝 Citation

If you use SynOmicsBench in your research, please cite:

> Trinh, T. C., Woillard, J. B., Uguzzoni, G., & Battail, C. (2024). **A unified benchmark of synthetic data generation for clinical and transcriptomic cancer data.** *(Manuscript in preparation)*

## 📄 License
This project is open-sourced under the MIT License.
