Metadata-Version: 2.4
Name: ct-validation
Version: 0.1.0
Summary: Benchmarking gene-indication evidence against clinical trial outcomes
Keywords: clinical-trials,target-validation,drug-discovery,genetics,enrichment
Author: Klim Kostiuk, Daniel Igumnov, Peter Fedichev, Amir Feizi
Author-email: Klim Kostiuk <2601074@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: duckdb>=1.4.3
Requires-Dist: numpy>=2.3.3
Requires-Dist: pandas>=2.3.3
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: scipy>=1.16.2
Requires-Dist: typer>=0.24.0
Requires-Dist: chembl-webresource-client ; extra == 'fetch'
Requires-Dist: huggingface-hub ; extra == 'fetch'
Requires-Dist: joblib ; extra == 'fetch'
Requires-Dist: tqdm ; extra == 'fetch'
Requires-Dist: hail ; extra == 'genebass'
Requires-Dist: mcp[cli]>=1.0 ; extra == 'mcp'
Requires-Dist: polars>=1.5 ; extra == 'parse'
Requires-Dist: requests ; extra == 'parse'
Requires-Dist: tqdm ; extra == 'parse'
Requires-Dist: cyvcf2 ; extra == 'parse'
Requires-Dist: matplotlib ; extra == 'plot'
Requires-Python: >=3.11
Project-URL: Homepage, https://github.com/gero-science/ct-validation
Project-URL: Issues, https://github.com/gero-science/ct-validation/issues
Project-URL: Repository, https://github.com/gero-science/ct-validation
Provides-Extra: fetch
Provides-Extra: genebass
Provides-Extra: mcp
Provides-Extra: parse
Provides-Extra: plot
Description-Content-Type: text/markdown

# <img width="300" alt="ct-validation" src="https://github.com/user-attachments/assets/fc5443b6-e93b-4fbb-a841-fffca899eedf" />

An open framework for benchmarking gene-indication evidence against clinical trial outcomes.


`ct-validation` tests whether a set of gene-indication pairs is enriched for clinical success. It computes risk ratios and odds ratios with confidence intervals across clinical phase transitions and supports semantic disease matching through ontology-based similarity.

> **Paper:** Kostiuk K, Igumnov D, Fedichev P, Feizi A. _ct-validation: an open framework for benchmarking gene-indication evidence against clinical trial outcomes._ (2026)

## Installation

Requires Python 3.11+.

```bash
pip install ct-validation
```

Optional extras:

```bash
pip install ct-validation[plot]  # forest plot visualization
pip install ct-validation[mcp]   # MCP server for agent workflows
pip install ct-validation[parse] # data source parsers
pip install ct-validation[fetch] # ChEMBL fetching script dependencies
```

## Quick start

### Python API

```python
import ct_validation as ctv

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/genetic_evidence.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
print(results)
#   phase_label  n_yes   n_no  rr  rr_ci_lower  rr_ci_upper  ...
```

Batch mode — compare multiple evidence sources at once:

```python
results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets=[
        "data/genetic_evidence/gwas_catalog.parquet",
        "data/genetic_evidence/clinvar.parquet",
        "data/genetic_evidence/omim.parquet",
    ],
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# returns a list of DataFrames, one per evidence source
```

Prioritized mode — test whether a novel source adds value over an established baseline:

```python
results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/novel_score.parquet",
    baseline_evidence="data/genetic_evidence/established_genetics.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# pairs supported only by baseline are excluded
```

Expand a disease set using semantic similarity:

```python
expanded = ctv.get_expanded_disease_set(
    efo_ids={"EFO:0000270", "EFO:0000384"},
    similarity_pairs="data/mappings/efo_similarity_lookup_0.5.parquet",
    similarity_threshold=0.8,
)
```

### CLI

```bash
# With config file
ct-validation --config configs/default.yaml

# With explicit arguments
ct-validation \
    --clinical-trials ct.parquet \
    --targets evidence.parquet \
    --similarity-lookup similarity.parquet \
    -o results/

# Batch mode (multiple evidence sources)
ct-validation \
    --clinical-trials ct.parquet \
    --targets gwas.parquet --targets clinvar.parquet --targets omim.parquet \
    -o results/
```

### MCP server

```bash
ct-validation-mcp
```

Exposes two tools for agent-based workflows:

- `ct_validate` — compute phase-transition enrichment
- `expand_disease_set` — expand EFO IDs via semantic similarity

## Input schemas

| Input               | Columns                              | Description                                        |
| ------------------- | ------------------------------------ | -------------------------------------------------- |
| `clinical_trials`   | `gene`, `efo_id`, `max_phase`        | Target-indication pairs with highest phase reached |
| `targets`           | `gene`, `efo_id`                     | Gene-indication pairs with supporting evidence     |
| `similarity_lookup` | `efo_id_1`, `efo_id_2`, `similarity` | Pairwise EFO similarity (optional)                 |
| `baseline_evidence` | `gene`, `efo_id`                     | Baseline evidence for prioritized mode (optional)  |
| `gene_universe`     | one gene per line (text file)         | Restrict analysis to these genes (optional)        |

All inputs accept Parquet files or pandas DataFrames (except `gene_universe`, which is a text file or a Python set).

## Output schema

| Column                             | Description                                  |
| ---------------------------------- | -------------------------------------------- |
| `phase_from`, `phase_to`           | Phase transition (e.g. 1→2, 1→4)             |
| `n_yes`, `n_no`                    | Pairs entering phase (with/without evidence) |
| `x_yes`, `x_no`                    | Pairs reaching target phase                  |
| `rate_yes`, `rate_no`              | Progression rates                            |
| `rr`, `rr_ci_lower`, `rr_ci_upper` | Risk ratio with 95% CI (Katz log method)     |
| `or`, `or_ci_lower`, `or_ci_upper` | Odds ratio with 95% CI (Woolf logit method)  |

## Enrichment logic

For each phase transition, target-indication pairs that reached at least the starting phase are divided into supported and unsupported groups. The risk ratio is:

```
RR = (x_yes / n_yes) / (x_no / n_no)
```

A risk ratio greater than one indicates that genetically supported pairs are more likely to progress. When a similarity lookup is provided, a pair (gene, disease) is considered supported if there exists evidence (gene, disease') with similarity above the threshold (default 0.8).

### Prioritized mode

When `baseline_evidence` is provided, pairs supported _only_ by the baseline are excluded. This tests whether a novel evidence source adds predictive value beyond an established benchmark.

## Visualization

```python
import ct_validation as ctv

results = ctv.validate(...)
ctv.forest_plot(results, metric="rr", title="Phase I → Approved")
```

## Data source parsers

The `scripts/` directory contains reproducible parsers for public databases:

**Genetic evidence** (`scripts/parse/genetic_evidence/`):

- GWAS Catalog — genome-wide significant associations (p < 1e-8)
- ClinVar — pathogenic/likely pathogenic variants
- OMIM — established molecular basis (mapping code 3)
- Open Targets — genetic evidence streams (score ≥ 0.5)
- Genebass — exome-wide associations (p ≤ 1e-7)

**Clinical trials** (`scripts/parse/clinical_trials/`):

- ChEMBL — gene-drug and drug-indication links (pChEMBL > 7.0)
- Open Targets — known drug and indication data
- STITCH — high-confidence activation/inhibition links
- DGIdb — drug-gene interactions
- TrialPanorama — interventional studies

**Ontology** (`scripts/r/`):

- EFO semantic similarity matrix (Lin + Resnik information content)

See [DATA_SOURCES.md](DATA_SOURCES.md) for download links, versions, and fetching instructions.

Configure paths in `configs/parsing.yaml` and run:

```bash
python scripts/parse/run_parsing.py
```

## Configuration

See `configs/default.yaml` for validation settings and `configs/parsing.yaml` for data source paths. All config values can be overridden via CLI arguments.

## License

MIT
