Metadata-Version: 2.4
Name: oncoref
Version: 1.8.72
Summary: Curated cancer reference data: ontology, TMB, incidence/mortality, and expression
Author-email: Alex Rubinsteyn <alex.rubinsteyn@unc.edu>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/pirl-unc/oncoref
Project-URL: Repository, https://github.com/pirl-unc/oncoref
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: pyarrow
Requires-Dist: PyYAML
Requires-Dist: pyensembl
Provides-Extra: plots
Requires-Dist: matplotlib; extra == "plots"
Requires-Dist: adjustText; extra == "plots"
Requires-Dist: matplotlib_venn>=1.0; extra == "plots"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: matplotlib; extra == "dev"
Requires-Dist: adjustText; extra == "dev"
Requires-Dist: matplotlib_venn>=1.0; extra == "dev"
Dynamic: license-file

# oncoref

[![Tests](https://github.com/pirl-unc/oncoref/actions/workflows/tests.yml/badge.svg)](https://github.com/pirl-unc/oncoref/actions/workflows/tests.yml)
[![PyPI](https://img.shields.io/pypi/v/oncoref.svg)](https://pypi.org/project/oncoref/)

Curated cancer reference data — cancer-type ontology, tumor mutational burden
(TMB), incidence/mortality, checkpoint-inhibitor (ICI) response, per-cohort
RNA-seq expression, HPA normal-tissue expression, and HPA-derived
cancer-testis antigen references — behind one small Python API, a data
fetch/cache CLI, and a set of reference plots.

## Where oncoref fits

The openvax/PIRL tools are split by ownership boundary, not by file format.
Newcomers should think of the stack as three layers:

- **oncoref** is the empirical base: what is true, measured, or canonically named
  about cancer types, cohorts, genes, and reference datasets. It owns gene
  identity and canonicalization, the cancer-type ontology/registry, expression
  reference data and normalization, epidemiology, TMB, ICI/aPD1 response, and
  source-anchored CTA facts. If a row has an `n`, a confidence interval, a
  source cohort, or a PMID/DOI anchoring a measurement, it is usually an oncoref
  fact.
- [**pirlygenes**](https://github.com/pirl-unc/pirlygenes) owns curated gene
  sets and panels: which genes are useful for a purpose. That includes
  lineage/family/compartment/supertype panels, discriminators, surfaceome, TME
  and stem-cell markers, response-signature panels, target-to-therapy registries,
  and other opinionated selections keyed to oncoref cancer codes and gene IDs. An
  empty set can be a valid pirlygenes answer.
- [**trufflepig**](https://github.com/pirl-unc/trufflepig) owns per-sample
  interpretation: QC narration, library-prep/source warnings, deconvolution,
  scoring, and rule tables that fire against one tumor sample.

That division keeps the dependency direction simple: pirlygenes and trufflepig
can depend on oncoref, but oncoref never imports its consumers. When a shared
fact is wrong, incomplete, or poorly anchored, fix it in oncoref. When a panel
or rule is purpose-specific, keep it in the downstream package and key it to
oncoref's ontology and gene identifiers.

## oncoref is the base layer

`oncoref` is **designed as the base layer** of the openvax/PIRL stack — the
intended upstream home for shared cancer reference mechanics and data, meant to
become a common dependency of
[pirlygenes](https://github.com/pirl-unc/pirlygenes),
[trufflepig](https://github.com/pirl-unc/trufflepig), and
[tsarina](https://github.com/pirl-unc/tsarina). Adoption is staged: downstream
packages can delegate parity-clean primitives while keeping their own curated
tables, packaged artifacts, and compatibility APIs until those surfaces are
ready to move. Architecturally oncoref stays at the bottom: it depends only on
pandas / numpy / pyarrow / PyYAML, it **never imports its consumers** (data and
logic flow only downward), and shared definitions should be fixed or exposed
here rather than reimplemented separately downstream.

Use oncoref for shared questions about

- **gene expression of cancer samples** — per-cohort RNA-seq in a normalized,
  comparable space: summary stats, tail-weighted percentiles, and medoid/exemplar
  samples per cancer type/subtype. Downstream packages may still keep packaged
  expression artifacts and compatibility wrappers while parity checks converge;
- **HPA protein / RNA** normal-tissue expression;
- the **HPA-derived cancer-testis antigen call** — the HPA tissue-restriction
  call over the candidate list (HPA-only; no pirlygenes therapy/MS curation
  layer);
- the **ontology of cancer types** — codes, the parent/child hierarchy, subtypes,
  families, characteristic driver fusions, and the cross-cutting MSI/POLE/HPV
  groupings; and
- **checkpoint-inhibitor response rates** and **TMB** per cancer type.

Everything keys on the cancer-type registry. The small curated tables ship in the
wheel; the heavy per-cohort expression bundle downloads on first use from
oncoref's own GitHub Release.

## Install

```bash
pip install oncoref
```

## Python API

The flat `oncoref` namespace remains available for compatibility and quick
interactive use. For new code, prefer the semantic submodules in
[docs/api.md](docs/api.md); they make it clearer whether you are working with
the cancer ontology, cohorts, ICI response, CTA coverage, generic antigen-panel
coverage, or CTA-specific peptides.

```python
import oncoref as od

od.resolve_cancer_type("prostate")        # -> "PRAD"
od.cancer_type_info("SARC_RMS_ARMS")      # full registry record + burden + tmb
od.cancer_tmb("LUAD_EGFR")                # 6.9  (inherited from LUAD)
od.cancer_burden("pancreas", metric="us_mortality_pct")
od.burden_category("SARC_OS")             # -> "bone_and_joint" (incidence/mortality bucket)
od.cancer_ici_response("SKCM")            # 42  (anti-PD-1 ORR %; fallback aPD-1 → aPD-L1 → combo)
od.cancer_ici_response("SKCM", regimen="PD-1+CTLA-4")   # 57.6  (pin a regimen)

# Cancer-testis antigens (HPA-derived tissue-restriction):
od.cta_gene_names()                       # expressed CTA symbols (MAGEA4, CT83, …)
od.cta_evidence()                         # full HPA restriction table

# Per-cohort expression percentiles (downloads the data bundle on first use):
od.cohort_gene_percentiles("PRAD")        # per-gene p0…p100 vector (within-cohort)
od.within_sample_top_fraction("PRAD")     # per-gene frac of samples top-5% (within-sample)
```

### Domains

- **Cancer ontology** — `oncoref.cancer_ontology`: `cancer_type_registry`,
  `resolve_cancer_type`, `cancer_type_records`, `cancer_type_codes`,
  `cancer_type_path`, `cancer_type_reference_data`, tree/family/lineage helpers,
  molecular subtype groups, source-scoped evidence resolution, matched normal
  tissue helpers, `viral_status`, `fusion_status`.
- **Cohorts** — `oncoref.cohorts`: `cohort_registry`, `cohort_aggregates`,
  `cohort_source_version`, and mixture-cohort helpers.
- **TMB** — `cancer_tmb`, `cancer_tmb_df` (parent-chain inheritance).
- **Incidence / mortality** — `cancer_burden`, `burden_category` (ACS / GLOBOCAN).
- **Checkpoint response** — `oncoref.ici_response`: regimen-aware ORR anchors,
  anti-PD-1 shortcuts, endpoint estimates, and pooled response summaries.
- **Expression** — `cohort_gene_percentiles`, `within_sample_top_fraction`,
  `representative_cohort_samples` over the lazy-downloaded per-cohort bundle.
- **Clean TPM / normalization** — `oncoref.normalization` for the 16/9/75
  compartment transform and `oncoref.gene_families` for clean-TPM censored
  compartment IDs plus the biological housekeeping denominator panel.
- **Cancer-testis antigens** — `oncoref.cta`: `cta_gene_names`/`cta_gene_ids`,
  `cta_evidence`, `synthesize_restriction` (HPA-only tissue-restriction; MS
  evidence stays in the target-selection layer).
- **CTA coverage / peptides** — `oncoref.cta_coverage` for patient coverage and
  `oncoref.cta_peptides` for CTA-specific 9-mer count maps and load.
- **Generic antigen-panel coverage** — `oncoref.antigen_coverage` for explicit
  non-CTA gene lists supplied by the caller. This computes coverage for a panel;
  it does not make oncoref the owner of downstream panel curation.
- **HPA normal tissue** — `hpa_rna_consensus`, `hpa_normal_tissue` (IHC),
  `hpa_single_cell`, and per-gene lookups (`gene_tissue_ntpm`,
  `gene_protein_tissues`, `gene_cell_type_ntpm`) over HPA v23, fetched on demand
  (`oncoref data fetch hpa`).
- **Genome reference** — `canonical_gene_id`, `canonical_gene_symbol`,
  `display_gene_name`, `short_gene_name`, `canonical_gene_id_and_name`,
  `find_gene_id_by_name`, `find_gene_name_from_ensembl_{gene,transcript}_id`,
  and `aggregate_gene_expression` (symbol/synonym/legacy-Ensembl-ID ↔ canonical
  Ensembl-ID resolution). pyensembl ships with the package, but some
  pyensembl-backed symbol resolution needs a downloaded human release once:
  `pyensembl install --release 111 --species homo_sapiens` (the accessors return
  `None` until then).
- **Legacy/compat response signatures** — `oncoref.response_signatures` ships a
  small historical checkpoint-response signature surface used by oncoref plots.
  Treat it as transitional: new therapy-response signature panels belong in
  pirlygenes, and this small surface should not be extended in oncoref unless it
  is recast as source-anchored empirical fact/provenance rows.
- **Plots** (`pip install oncoref[plots]`) — `oncoref.plots.apd1_vs_tmb`,
  `apd1_orr_bars`, `incidence_vs_mortality`, the CTA/coverage figures, and
  `oncoref.cta_curation_plots.render`.

## CLI

```bash
oncoref cancer-type prostate     # registry info as JSON
oncoref tmb LUAD_EGFR            # 6.9
oncoref ici SKCM                # 42  (--regimen to pin, --all-regimens to compare)
oncoref burden pancreas --metric us_mortality_pct
oncoref cta --count             # number of expressed CTAs
oncoref plot apd1-vs-tmb --out apd1_vs_tmb.png
oncoref plot patient-coverage --gene-set cta --out coverage_out
oncoref plot cta-curation --out cta_curation_out

# managed data downloads/cache:
oncoref data list               # every wheel/bundle/HPA/source dataset
oncoref data status bundle      # expression-bundle cache state (no download)
oncoref data metadata           # package/data/cache/release contract JSON
oncoref data fetch bundle       # download the large expression bundle
oncoref data fetch hpa          # download HPA reference data (RNA / IHC / single-cell)
oncoref data dir bundle         # where the data bundle is cached
oncoref data prune --yes        # delete stale bundle version caches
oncoref version
```

## Development

```bash
./develop.sh   # editable install with dev extras
./format.sh    # ruff format
./lint.sh      # ruff check + format --check
./test.sh      # lint + pytest with coverage
```

## License

Apache 2.0.
