Metadata-Version: 2.4
Name: syng-bts
Version: 3.3.1
Summary: Synthesis of Next Generation Bulk Transcriptomic Sequencing - A data augmentation tool for transcriptomics data using deep generative models
Author: Yunhui Qi, Xinyi Wang, Yannick Dueren
Author-email: Li-Xuan Qin <qinl@mskcc.org>
Maintainer-email: Li-Xuan Qin <qinl@mskcc.org>
License-Expression: AGPL-3.0-only
Project-URL: Homepage, https://github.com/Omics-Data-Synthesis/SyNG-BTS
Project-URL: Documentation, https://syng-bts.readthedocs.io/
Project-URL: Repository, https://github.com/Omics-Data-Synthesis/SyNG-BTS
Project-URL: Issues, https://github.com/Omics-Data-Synthesis/SyNG-BTS
Keywords: transcriptomics,data-augmentation,deep-learning,VAE,GAN,generative-models,bioinformatics,RNA-seq,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: numpy>=1.23.0
Requires-Dist: scipy>=1.9.0
Requires-Dist: matplotlib>=3.6.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: tensorboardX>=2.6.0
Requires-Dist: umap-learn>=0.5.6
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: xgboost>=2.0.0
Provides-Extra: docs
Requires-Dist: sphinx>=7.1.2; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: nbsphinx>=0.9.0; extra == "docs"
Requires-Dist: pandoc>=2.3; extra == "docs"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Provides-Extra: all
Requires-Dist: syng-bts[dev,docs]; extra == "all"
Dynamic: license-file

# SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing

<!-- [![PyPI version](https://badge.fury.io/py/syng-bts.svg)](https://badge.fury.io/py/syng-bts) -->
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
<!-- [![Documentation Status](https://readthedocs.org/projects/syng-bts/badge/?version=latest)](https://syng-bts.readthedocs.io/en/latest/?badge=latest) -->

**SyNG-BTS** is a Python package for data augmentation of bulk transcriptomic sequencing data using deep generative models. 
It synthesizes realistic transcriptomic data without relying on predefined formulas, enabling researchers to augment small pilot datasets for more robust machine learning analyses. 
SyNG-BTS supports three generative model families: variational autoencoders (VAE), generative adversarial networks (GAN), and flow-based models. 
These models are trained on a pilot dataset and can synthesize additional samples at any desired scale.

<p align="center">
  <img src="./figs/workflow.png" width="800" alt="SyNG-BTS Workflow" />
</p>

## Features

- **Multiple Generative Models**: VAE, CVAE, GAN, WGANGP, and flow-based models (MAF)
- **Unified API**: Run pilot experiments, generate synthetic data, and perform transfer learning
- **DataFrame-First API**: Accept pandas DataFrames, CSV file paths, or bundled dataset names
- **Rich Result Objects**: `SyngResult` / `PilotResult` with built-in plotting and export
- **In-Memory Pipeline**: No disk I/O by default — results stay in memory until you choose to save
- **Built-in Evaluation**: Heatmap and UMAP visualization functions
- **Sample-Size Evaluation**: Integrated [SyntheSize](https://github.com/LXQin/SyntheSize) methodology for classifier learning curves
- **Bundled Datasets**: Example TCGA datasets included for immediate experimentation

## Installation

### From PyPI (Recommended)

```bash
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ syng-bts
```
*TODO: Update to main PyPI installation upon release.*

### From Source

```bash
git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
pip install -e .
```

### Optional Dependencies

For documentation building:
```bash
pip install syng-bts[docs]
```

For development (testing, linting):
```bash
pip install syng-bts[dev]
```

## Quick Start

### Generate Synthetic Data

```python
from syng_bts import generate

# Train a VAE on bundled data and generate 500 synthetic samples
result = generate(
    data="SKCMPositive_4",   # bundled dataset name, CSV path, or DataFrame
    model="VAE1-10",         # model specification (type + kl_weight)
    new_size=500,            # number of synthetic samples
    batch_frac=0.1,          # batch fraction
    learning_rate=0.0005,    # learning rate
)

# Grouped-size controls
# - int: exact total sample count (grouped data is split by input ratio)
# - list[int]: explicit grouped counts [n_group_0, n_group_1]
#   where group_0 is the first group value encountered in input data.

# Access results in memory
print(result.generated_data.shape)   # (500, n_features)
print(result.loss.columns.tolist())  # ['kl', 'recons']
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()  # dict[str, Figure]

# Optionally save to disk
result.save("./my_output/")

# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")
```

### Run a Pilot Study

```python
from syng_bts import pilot_study

# Sweep over multiple pilot sizes (5 random draws each)
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual runs
run = pilot.runs[(50, 1)]  # (pilot_size, draw_index)
print(run.generated_data.head())

# All runs overlaid on one plot per loss column
figs = pilot.plot_loss(style="overlay_runs")       # dict[str, Figure]

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")  # dict[str, Figure]
```

### Transfer Learning

```python
from syng_bts import transfer

# Pre-train on PRAD, fine-tune and generate on BRCA
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")
```

### Use DataFrame Input

```python
import pandas as pd
from syng_bts import generate

my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",     # used in output filenames
    model="WGANGP",
    new_size=1000,
    epoch=50,
)
```

### Evaluate Generated Data

```python
from syng_bts import generate, resolve_data, heatmap_eval, UMAP_eval

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
real_data, _groups = resolve_data("SKCMPositive_4")

# Built-in heatmap on result object
fig = result.plot_heatmap()

# Standalone evaluation comparing real vs generated
fig_heatmap = heatmap_eval(real_data=real_data.head(50), generated_data=result.generated_data.head(50))
fig_umap = UMAP_eval(real_data=real_data, generated_data=result.generated_data, random_seed=42)
```

### Sample-Size Evaluation (SyntheSize)

Evaluate how classifier performance scales with sample size using the integrated
[SyntheSize](https://github.com/LXQin/SyntheSize) methodology
([R version](https://github.com/LXQin/SyntheSize),
[Python version](https://github.com/LXQin/SyntheSize_py)):

```python
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers at increasing sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=[50, 100, 150],
    groups=groups,
    n_draws=5,
    methods=["LOGIS", "RF", "XGB"],
)

# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")
```

`evaluate_sample_sizes` applies `log2(x + 1)` by default (`apply_log=True`).
Set `apply_log=False` if your inputs are already log-transformed.

You can also pass a `SyngResult` directly — groups are auto-resolved:

```python
from syng_bts import generate, evaluate_sample_sizes

result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=50)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")
```

### List Available Datasets

```python
from syng_bts import list_bundled_datasets

print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]
```

## Available Models

| Model | Description |
|-------|-------------|
| `VAE1-10` | Variational Auto-Encoder with 1:10 loss ratio |
| `CVAE1-10` | Conditional VAE with 1:10 loss ratio |
| `GAN` | Standard Generative Adversarial Network |
| `WGANGP` | Wasserstein GAN with Gradient Penalty |
| `maf` | Masked Autoregressive Flow |

## Dependencies

SyNG-BTS requires Python 3.10+ and the following packages:

- torch (>=2.0.0)
- pandas (>=1.5.0)
- numpy (>=1.23.0)
- scipy (>=1.9.0)
- matplotlib (>=3.6.0)
- seaborn (>=0.12.0)
- scikit-learn (>=1.3.0)
- xgboost (>=2.0.0)
- tensorboardX (>=2.6.0)
- umap-learn (>=0.5.6)
- pyarrow (>=14.0.0)

## Documentation

Full documentation is available at [syng-bts.readthedocs.io](https://syng-bts.readthedocs.io/).

- [Installation Guide](https://syng-bts.readthedocs.io/en/latest/usage.html#installation)
- [API Reference](https://syng-bts.readthedocs.io/en/latest/api.html)
- [Examples & Tutorials](https://syng-bts.readthedocs.io/en/latest/evals.html) 

<!-- TODO Update the examples and tutorials -->

## Development

### Quick Setup

```bash
git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
make init-dev  # Install package + dev dependencies
```

### Makefile Commands

The project includes a Makefile for common development tasks:

| Command | Description |
|---------|-------------|
| `make help` | Show all available commands |
| `make install` | Install package in editable mode |
| `make install-dev` | Install development dependencies |
| `make init-dev` | Full dev setup (install + dev deps in venv) |
| `make test` | Run tests with pytest |
| `make test-cov` | Run tests with coverage report |
| `make lint` | Check code with ruff |
| `make format` | Auto-format code with ruff |
| `make check` | Run lint + tests |
| `make docs` | Build documentation |
| `make clean` | Remove build artifacts |

## Citation

If you use SyNG-BTS in your research, please cite:

> Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025 Mar 4;26(2):bbaf097. doi: 10.1093/bib/bbaf097. PMID: 40072846; PMCID: PMC11899567.

**BibTeX:**
```bibtex
@article{qin2025optimizing,
  title = {Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach},
  author = {Qi, Yunhui and Wang, Xinyi and Qin, Li-Xuan},
  journal = {Brief Bioinformatics},
  year = {2025},
  volume = {26},
  number = {2},
  pages = {bbaf097},
  doi = {10.1093/bib/bbaf097},
  url = {https://pmc.ncbi.nlm.nih.gov/articles/PMC11899567/}
}
```

## License

SyNG-BTS is licensed under the [GNU Affero General Public License v3.0](LICENSE).

## Acknowledgments

This package was developed at Memorial Sloan Kettering Cancer Center. We thank Sebastian Raschka for the [STAT 453 course materials](https://sebastianraschka.com/teaching/stat453-ss2021/) that provided foundational concepts for the deep generative models.

