Metadata-Version: 2.4
Name: cbioformatter
Version: 0.1.0
Summary: Streamline conversion of clinical and genomic data into cBioPortal-compatible formats
Project-URL: Homepage, https://github.com/getwilds/cbioformatter
Project-URL: Repository, https://github.com/getwilds/cbioformatter
Project-URL: Issues, https://github.com/getwilds/cbioformatter/issues
Author-email: Taylor Firman <tfirman@fredhutch.org>
License-Expression: MIT
License-File: LICENSE
Keywords: bioinformatics,cbioportal,clinical-data,data-formatting,genomics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.1.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.28.0
Provides-Extra: dev
Requires-Dist: ipython>=8.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# cBioFormatter

A Python package for streamlined preparation and formatting of clinical and molecular genomic data for upload to cBioPortal.

## Overview

cBioFormatter simplifies the process of converting your genomic data into cBioPortal-compatible formats. Designed for data scientists with basic Python knowledge, this package handles all the complexity of cBioPortal file formatting, validation, and metadata generation.

**What it does:**
- Converts clinical data (patient and sample attributes) into cBioPortal format
- Processes VCF files into MAF format for mutation data
- Generates all required metadata files automatically
- Validates your study using cBioPortal's official validator
- Creates case lists for sample grouping
- Uploads studies and gene panels into a running cBioPortal instance (optional)
- Fetches public studies and gene panels from the cBioPortal datahub (optional)

**What you need:**
- Basic Python knowledge (pandas DataFrames, module imports)
- Your clinical data (Excel, CSV, database query, anything that can be converted to a pandas DataFrame)
- VCF files for mutation data (optional)
- vcf2maf installed (for VCF processing, optional)

## Installation

```bash
pip install cbioportal-formatter
```

**Additional requirements:**
- vcf2maf (for mutation data processing, if using VCF files) - see [vcf2maf installation guide](https://github.com/mskcc/vcf2maf)

## Development

For local development, clone the repository and install in editable mode with dev dependencies.

### Using uv (recommended)

[uv](https://docs.astral.sh/uv/) is a fast Python package manager. If you don't have it installed:

```bash
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

Then set up the project:

```bash
git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
uv sync --extra dev
```

To run commands in the virtual environment:

```bash
uv run pytest              # Run tests
uv run pytest --cov        # Run tests with coverage
uv run ruff check .        # Run linter
uv run ruff format .       # Format code
uv run ipython             # Interactive Python shell (or: uv run python)
```

### Using pip

```bash
git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"
```

To run tests and linting:

```bash
pytest                     # Run tests
pytest --cov               # Run tests with coverage
ruff check .               # Run linter
ruff format .              # Format code
ipython                    # Interactive Python shell (or: python)
```

## Quick Start

### Basic Study with Clinical Data Only

```python
import pandas as pd
from cbioformatter import ClinicalStudy

# Prepare your sample-level clinical data
# (typically loaded from a CSV, Excel file, or database query)
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'AGE_AT_DIAGNOSIS': [45, 45, 67]
})

# sample_df looks like:
# | SAMPLE_ID | PATIENT_ID | TUMOR_TYPE | AGE_AT_DIAGNOSIS |
# |-----------|------------|------------|------------------|
# | S001      | P001       | Primary    | 45               |
# | S002      | P001       | Metastasis | 45               |
# | S003      | P002       | Primary    | 67               |

# Prepare your patient-level clinical data (optional)
patient_df = pd.DataFrame({
    'PATIENT_ID': ['P001', 'P002'],
    'SEX': ['Female', 'Male'],
    'ETHNICITY': ['Hispanic', 'Asian']
})

# patient_df looks like:
# | PATIENT_ID | SEX    | ETHNICITY |
# |------------|--------|-----------|
# | P001       | Female | Hispanic  |
# | P002       | Male   | Asian     |

# Create and validate the study
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",  # must be a valid cBioPortal cancer type
    genome_build="GRCh38",  # Options: "GRCh37", "hg19", or "GRCh38"
    sample_data=sample_df,
    patient_data=patient_df  # optional
)

# Validate the study (generates temp files, runs validator, cleans up)
result = study.validate()

if result.is_valid:
    print("✓ Study is valid!")
    print(f"Validation report: {result.report_path}")
    
    # Write files to disk
    study.write_files(output_dir="./my_studies")
    print(f"Study files written to: ./my_studies/brca_ocdo_2026/")
else:
    print("✗ Validation failed. Check the report for details:")
    print(f"Report: {result.report_path}")
```

### Study with Mutation Data

```python
# Add VCF file paths to your sample DataFrame
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'VCF_PATH': [
        '/data/vcf/S001.vcf',
        '/data/vcf/S002.vcf',
        None  # This sample has no mutation data
    ]
})

# The rest is identical - mutation data is automatically detected
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",
    genome_build="GRCh38",
    sample_data=sample_df
)

result = study.validate()
if result.is_valid:
    study.write_files(output_dir="./my_studies")
```

### Uploading to a Local cBioPortal Instance

If you're running a local cBioPortal instance (via Docker), you can upload the study directly:

```python
# After write_files() has produced a study directory:
study_dir = study.write_files(output_dir="./my_studies")

study.upload(
    study_dir,
    url="http://localhost:8080/",   # your cBioPortal URL
    container="cbioportal",          # docker-compose service name
)
```

Upload is an **optional** advanced step — it requires a running cBioPortal instance and Docker. Validation (`study.validate()`) does **not** require any of this; it runs locally so newcomers can format and validate without setting up infrastructure.

### Fetching Public Studies from the cBioPortal Datahub

```python
from cbioformatter import fetch_datahub_study, fetch_datahub_panel

# Download a public study (returns the path to the extracted directory)
study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")

# Download a gene panel definition
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")
```

Useful for seeding a fresh cBioPortal instance with reference data, or for round-tripping public studies through cbioformatter for testing.

## Features

### Clinical Data Handling

**Required columns:**
- `SAMPLE_ID` in sample DataFrame (must be unique)
- `PATIENT_ID` in patient DataFrame if provided (must be unique)

**Smart defaults:**
- If `patient_data` is not provided, it's auto-generated from unique `PATIENT_ID` values in `sample_data`
- If `PATIENT_ID` column is missing from `sample_data`, each sample is assigned its own patient (`PATIENT_ID = SAMPLE_ID`)
- Column names are automatically cleaned for cBioPortal compatibility while preserving display names
- Data types are automatically inferred: NUMBER (int/float), BOOLEAN (bool), STRING (everything else)

**Validation:**
- Ensures all `SAMPLE_ID` values are unique
- Ensures all `PATIENT_ID` values are unique (if patient data provided)
- Validates referential integrity (all patient IDs in samples exist in patient data)
- Failures raise clear exceptions with specific issues identified

### Mutation Data Processing

**Input:** VCF files (one per sample)

**How it works:**
1. Add a `VCF_PATH` column to your `sample_data` DataFrame with file paths
2. VCF files are automatically converted to MAF format using vcf2maf
3. All MAF files are concatenated into a single mutation file
4. Sample IDs are correctly mapped to `Tumor_Sample_Barcode`

**Flexible data availability:**
- If `VCF_PATH` column is missing entirely → no mutation data included
- If some samples have VCF paths and others don't → mutation data included only for samples with valid paths
- At least one valid VCF path must be provided if the column exists

**Requirements:**
- vcf2maf must be installed (see [installation guide](https://github.com/mskcc/vcf2maf))
- VCF files must match the specified genome build (`GRCh37` or `GRCh38`)
- Reference genome files for vcf2maf (users provide their own reference path)

### Study Validation

The `validate()` method:
1. Creates temporary files in cBioPortal format
2. Runs the official cBioPortal validator (from [cBioPortal datahub-study-curation-tools](https://github.com/cBioPortal/datahub-study-curation-tools))
3. Generates an HTML validation report
4. Cleans up temporary files
5. Returns a validation result object

**Validation result object:**
```python
result.is_valid      # True if validation passed (clean or warnings-only)
result.report_path   # Path to HTML validation report
result.errors        # Errors AND/OR warnings emitted by the validator
```

`is_valid` is `True` for a clean validation and for warnings-only results; in the warnings-only case, `result.errors` is populated and `write_files(validate=True)` proceeds with a `UserWarning`. Errors (validator exit code 1 or 2) raise `ValidationError` from `write_files(validate=True)` and study files are not written.

**Validator acquisition:** The cBioPortal validator is AGPL-3.0 licensed and lives in a separate repository, so cbioformatter does not bundle it. On first `validate()` call, the validator is cloned into `~/.cache/cbioformatter/validator/` (~5 MB, requires `git` and internet). Subsequent calls reuse the cache.

For air-gapped or CI environments, pre-clone the validator and set `CBIOFORMATTER_VALIDATOR_PATH`:

```bash
git clone --depth 1 https://github.com/cBioPortal/datahub-study-curation-tools.git
export CBIOFORMATTER_VALIDATOR_PATH=$(pwd)/datahub-study-curation-tools/validation/validator
```

### File Output

The `write_files()` method generates a complete cBioPortal study directory:

```
my_studies/
└── brca_ocdo_2026/
    ├── meta_study.txt
    ├── meta_clinical_patient.txt
    ├── data_clinical_patient.txt
    ├── meta_clinical_sample.txt
    ├── data_clinical_sample.txt
    ├── meta_mutations.txt      # if mutation data provided
    ├── data_mutations.txt      # if mutation data provided
    ├── case_lists/
    │   ├── cases_all.txt
    │   └── cases_sequenced.txt          # if mutation data provided
```

**Parameters:**
- `output_dir` (default: `"."`) - Base directory for output. Study files are created in `{output_dir}/{study_id}/`
- `validate` (default: `True`) - If `True`, runs validation before writing files. Set to `False` to skip validation (use with caution).

### Uploading to cBioPortal (Optional)

For users running their own cBioPortal instance, cbioformatter can push studies and gene panels directly into the running server. This is a fully **optional** advanced feature — the formatting and validation features above work standalone.

**Requirements:**
- A running cBioPortal instance (typically via Docker)
- Docker accessible on your machine (`docker compose` available in your PATH)
- The host directory containing your study must be bind-mounted into the cBioPortal container

**Uploading a study:**
```python
study.upload(
    study_dir,                          # Path returned by write_files()
    url="http://localhost:8080/",       # cBioPortal instance URL
    container="cbioportal",             # docker-compose service name
    mount_path="/study",                # path inside container where study_dir is mounted
)
```

The `upload()` method invokes `metaImport.py` inside the cBioPortal container and returns a result object with the import status and a link to the HTML report.

**Uploading a gene panel:**
```python
from cbioformatter import upload_gene_panel

upload_gene_panel(
    panel_file="./panels/data_gene_panel_impact341.txt",
    container="cbioportal",
    mount_path="/study",
)
```

Gene panels are study-independent reference data — they need to be loaded into cBioPortal **before** any studies that reference them.

**Environment variable defaults:**
- `CBIOPORTAL_URL` — overrides the default `url`
- `CBIOPORTAL_CONTAINER` — overrides the default `container`
- `CBIOPORTAL_MOUNT_PATH` — overrides the default `mount_path`

### Fetching from the cBioPortal Datahub (Optional)

The [cBioPortal datahub](https://github.com/cBioPortal/datahub) hosts public studies and gene panel definitions. cbioformatter provides utilities to download them:

```python
from cbioformatter import fetch_datahub_study, fetch_datahub_panel

study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")
```

These functions return local paths; they do **not** automatically upload the fetched data. Combine with `upload()` / `upload_gene_panel()` for a complete fetch-and-load workflow.

## API Reference

### ClinicalStudy

```python
ClinicalStudy(
    study_id: str,
    name: str,
    description: str,
    cancer_type: str,
    genome_build: str,
    sample_data: pd.DataFrame,
    patient_data: pd.DataFrame = None
)
```

**Parameters:**
- `study_id`: Unique identifier for the study (no spaces, lowercase recommended)
- `name`: Human-readable study name
- `description`: Brief description of the study
- `cancer_type`: Valid cBioPortal cancer type (see [cBioPortal documentation](https://docs.cbioportal.org/file-formats/#cancer-type))
- `genome_build`: Reference genome build. Accepts UCSC names (`"hg19"`, `"hg38"`, `"mm10"`) or NCBI/Ensembl aliases (`"GRCh37"`, `"GRCh38"`, `"GRCm38"`); aliases are translated to the UCSC form on write since cBioPortal's validator only accepts UCSC names
- `sample_data`: pandas DataFrame with sample-level clinical attributes. Must include `SAMPLE_ID`. Optionally includes `PATIENT_ID` and `VCF_PATH`
- `patient_data`: Optional pandas DataFrame with patient-level clinical attributes. Must include `PATIENT_ID` if provided

**Methods:**

#### `validate()`
Validates the study using cBioPortal's official validator.

**Returns:** `ValidationResult` object with:
- `is_valid` (bool): Whether validation passed
- `report_path` (str): Path to HTML validation report
- `errors` (list): List of validation errors if validation failed

#### `write_files(output_dir=".", validate=True)`
Writes all study files to disk.

**Parameters:**
- `output_dir` (str): Base output directory (default: current directory)
- `validate` (bool): If True, runs validation before writing files (default: True)

**Returns:** `Path` to the created study directory (`{output_dir}/{study_id}/`)

**Raises:**
- `ValidationError` if `validate=True` and the cBioPortal validator reports errors. Study files are not written. Pass `validate=False` to skip validation.

#### `upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)`
Uploads a written study directory into a running cBioPortal instance via `metaImport.py`.

**Parameters:**
- `study_dir` (str | Path): Path to the study directory produced by `write_files()`
- `url` (str): cBioPortal instance URL (default: `"http://localhost:8080/"`, or `$CBIOPORTAL_URL`)
- `container` (str): Name of the cbioportal docker-compose service (default: `"cbioportal"`, or `$CBIOPORTAL_CONTAINER`)
- `mount_path` (str): Path inside the container where `study_dir`'s parent is bind-mounted (default: `"/study"`, or `$CBIOPORTAL_MOUNT_PATH`)
- `report_dir` (str | Path, optional): Where to save the HTML import report (default: alongside `study_dir`)

**Returns:** `UploadResult` object with:
- `success` (bool): Whether import succeeded
- `report_path` (str): Path to HTML import report
- `errors` (list): List of import errors if upload failed

**Raises:**
- `RuntimeError` if Docker is not running or the container cannot be reached

### Module-level functions

#### `upload_gene_panel(panel_file, container=..., mount_path=...)`
Imports a single gene panel definition file into a running cBioPortal instance via `importGenePanel.pl`.

**Parameters:**
- `panel_file` (str | Path): Path to the panel definition file
- `container` (str): docker-compose service name (default: `"cbioportal"`, or `$CBIOPORTAL_CONTAINER`)
- `mount_path` (str): Path inside the container where `panel_file`'s parent is mounted (default: `"/study"`, or `$CBIOPORTAL_MOUNT_PATH`)

#### `fetch_datahub_study(study_id, output_dir=".")`
Downloads and extracts a public study from the cBioPortal datahub.

**Parameters:**
- `study_id` (str): Datahub study ID (e.g., `"msk_impact_2017"`, `"chol_tcga"`)
- `output_dir` (str | Path): Where to extract the study (default: current directory)

**Returns:** `Path` to the extracted study directory

**Raises:**
- `ValueError` if the study ID is not found in the datahub

#### `fetch_datahub_panel(panel_name, output_dir=".")`
Downloads a public gene panel definition from the cBioPortal datahub.

**Parameters:**
- `panel_name` (str): Datahub panel name (e.g., `"impact341"`, `"impact468"`)
- `output_dir` (str | Path): Where to save the file (default: current directory)

**Returns:** `Path` to the downloaded panel file

**Raises:**
- `ValueError` if the panel name is not found in the datahub

## Example Workflow

See the [example notebook](examples/basic_usage.ipynb) for a complete walkthrough using simulated data.

## Supported Data Types (Current Version)

- ✅ Clinical data (patient and sample attributes)
- ✅ Mutation data (VCF → MAF conversion)
- ⏳ Copy number alterations (CNA) - planned for future release
- ⏳ Gene expression data - planned for future release
- ⏳ Methylation data - planned for future release

## Supported Workflows

- ✅ Format clinical and genomic data into cBioPortal-compatible files
- ✅ Validate study files locally (no cBioPortal instance required)
- ⏳ Upload studies into a running cBioPortal instance - planned
- ⏳ Import gene panel definitions - planned
- ⏳ Fetch public studies and panels from the cBioPortal datahub - planned

## Requirements

- Python 3.10+
- pandas
- vcf2maf (optional, for VCF processing)
- Docker with a running cBioPortal instance (optional, only for upload features)

## External Tools

This package relies on the following external tools for mutation data processing:

**vcf2maf** (optional, for VCF processing):
- Required only if you're including mutation data from VCF files
- See [vcf2maf installation guide](https://github.com/mskcc/vcf2maf) for setup instructions
- Requires a reference genome (GRCh37 or GRCh38)

## Troubleshooting

### Common Issues

**"SAMPLE_ID duplicates found"**
- Ensure all values in your `SAMPLE_ID` column are unique
- Check for accidentally duplicated rows in your data

**"PATIENT_ID 'P123' not found in patient data"**
- Every patient ID referenced in sample data must exist in patient data
- If you didn't provide patient data, this shouldn't happen (it's auto-generated)

**"VCF file not found: /path/to/file.vcf"**
- Check that all file paths in the `VCF_PATH` column are correct
- Ensure files are accessible from your current working directory

**"vcf2maf not found"**
- Install vcf2maf following the [installation guide](https://github.com/mskcc/vcf2maf)
- Ensure vcf2maf is available in your PATH

**Validation fails with complex errors**
- Review the HTML validation report at the path provided
- Common issues: incorrect cancer type, malformed column names, missing required fields

## Contributing

Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

## Citation

If you use cBioFormatter in your research, please mention the GitHub repository:

> cBioFormatter: https://github.com/getwilds/cbioportal-formatter

**Future aim:** We plan to submit cBioFormatter to the [Journal of Open Source Software (JOSS)](https://joss.theoj.org/) for peer review. Once published, a formal citation will be provided here.

## Contact

**Fred Hutch users:**
- FH-Data Slack: [#cbioportal-support](https://fhdata.slack.com/archives/C088E41ARV3) channel (or reach out to Taylor Firman or Emma Bishop)
- [Research Computing Data House Call](https://calendly.com/data-house-calls/computing?back=1&month=2026-01)

**External users:**
- Email: wilds@fredhutch.org
- **Issues:** [GitHub Issues](https://github.com/getwilds/cbioportal-formatter/issues)
- **Questions:** [GitHub Discussions](https://github.com/getwilds/cbioportal-formatter/discussions)

## Acknowledgments

- Built to support the Fred Hutch Cancer Center cBioPortal instance
- Uses cBioPortal's official validation tools
- Part of the [WILDS](https://getwilds.org/) ecosystem
