Metadata-Version: 2.4
Name: tessera-foundation
Version: 0.1.2
Summary: TESSERA: a foundation model for the cancer genome.
Author-email: John-William Sidhom <johnwilliamsidhom@gmail.com>
License: PolyForm Noncommercial License 1.0.0
        
        <https://polyformproject.org/licenses/noncommercial/1.0.0>
        
        ## Acceptance
        
        In order to get any license under these terms, you must agree
        to them as both strict obligations and conditions to all
        your licenses.
        
        ## Copyright License
        
        The licensor grants you a copyright license for the
        software to do everything you might do with the software
        that would otherwise infringe the licensor's copyright
        in it for any permitted purpose.  However, you may
        only distribute the software according to [Distribution
        License](#distribution-license) and make changes or new works
        based on the software according to [Changes and New Works
        License](#changes-and-new-works-license).
        
        ## Distribution License
        
        The licensor grants you an additional copyright license
        to distribute copies of the software.  Your license
        to distribute covers distributing the software with
        changes and new works permitted by [Changes and New Works
        License](#changes-and-new-works-license).
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of
        the software from you also gets a copy of these terms or the
        URL for them above, as well as copies of any plain-text lines
        beginning with `Required Notice:` that the licensor provided
        with the software.  For example:
        
        > Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
        > TESSERA is licensed for academic and non-commercial use only.
        > Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
        
        ## Changes and New Works License
        
        The licensor grants you an additional copyright license to
        make changes and new works based on the software for any
        permitted purpose.
        
        ## Patent License
        
        The licensor grants you a patent license for the software that
        covers patent claims the licensor can license, or becomes able
        to license, that you would infringe by using the software.
        
        ## Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        ## Personal Uses
        
        Personal use for research, experiment, and testing for
        the benefit of public knowledge, personal study, private
        entertainment, hobby projects, amateur pursuits, or religious
        observance, without any anticipated commercial application,
        is use for a permitted purpose.
        
        ## Noncommercial Organizations
        
        Use by any charitable organization, educational institution,
        public research organization, public safety or health
        organization, environmental protection organization,
        or government institution is use for a permitted purpose
        regardless of the source of funding or obligations resulting
        from the funding.
        
        ## Fair Use
        
        You may have "fair use" rights for the software under the
        law. These terms do not limit them.
        
        ## No Other Rights
        
        These terms do not allow you to sublicense or transfer any of
        your licenses to anyone else, or prevent the licensor from
        granting licenses to anyone else.  These terms do not imply
        any other licenses.
        
        ## Patent Defense
        
        If you make any written claim that the software infringes or
        contributes to infringement of any patent, your patent license
        for the software granted under these terms ends immediately. If
        your company makes such a claim, your patent license ends
        immediately for work on behalf of your company.
        
        ## Violations
        
        The first time you are notified in writing that you have
        violated any of these terms, or done anything with the software
        not covered by your licenses, your licenses can nonetheless
        continue if you come into full compliance with these terms,
        and take practical steps to correct past violations, within
        32 days of receiving notice.  Otherwise, all your licenses
        end immediately.
        
        ## No Liability
        
        ***As far as the law allows, the software comes as is, without
        any warranty or condition, and the licensor will not be liable
        to you for any damages arising out of these terms or the use
        or nature of the software, under any kind of legal claim.***
        
        ## Definitions
        
        The **licensor** is the individual or entity offering these
        terms, and the **software** is the software the licensor makes
        available under these terms.
        
        **You** refers to the individual or entity agreeing to these
        terms.
        
        **Your company** is any legal entity, sole proprietorship,
        or other kind of organization that you work for, plus all
        organizations that have control over, are under the control of,
        or are under common control with that organization.  **Control**
        means ownership of substantially all the assets of an entity,
        or the power to direct its management and policies by vote,
        contract, or otherwise.  Control can be direct or indirect.
        
        **Your licenses** are all the licenses granted to you for the
        software under these terms.
        
        **Use** means anything you do with the software requiring one
        of your licenses.
        
        ---
        
        Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
        TESSERA is licensed for academic and non-commercial use only.
        Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
        
Project-URL: Homepage, https://github.com/JW-Sidhom-Lab/tessera
Project-URL: Repository, https://github.com/JW-Sidhom-Lab/tessera
Project-URL: Model weights, https://huggingface.co/JW-Sidhom-Lab/tessera-foundation
Project-URL: Issues, https://github.com/JW-Sidhom-Lab/tessera/issues
Keywords: cancer-genomics,foundation-model,self-supervised-learning,tcga,somatic-variants,copy-number-alterations,bioinformatics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tensorflow>=2.16
Requires-Dist: numpy
Requires-Dist: pandas>=2.0
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Requires-Dist: pyfaidx>=0.7
Requires-Dist: pyliftover>=0.4
Requires-Dist: tqdm>=4.66
Requires-Dist: huggingface_hub>=0.20
Dynamic: license-file

<p align="center">
  <img src="logo.png" alt="TESSERA logo" width="220">
</p>

<p align="center">
  <em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
  A foundation model for the cancer genome.
</p>

---

TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.

This repository contains the reference implementation, the pretrained-weights pointer, and the end-to-end analysis pipelines that accompany the TESSERA manuscript.

## Quick start

```bash
pip install tessera-foundation
```

```python
import tessera, pandas as pd

snv_df = pd.read_csv("snv.csv")    # cols: Tumor_Sample_Barcode, Chromosome,
                                   # Start_Position, Reference_Allele,
                                   # Tumor_Seq_Allele2, vaf
cna_df = pd.read_csv("cna.csv")    # cols: Tumor_Sample_Barcode, Chromosome,
                                   # Start, End, Segment_Mean

result = tessera.featurize(
    snv_df=snv_df, cna_df=cna_df,
    variant="joint_snv_cna_noloh",        # or "joint_snv_cna" (with-LoH)
    from_assembly="GRCh37",               # "GRCh38" triggers UCSC liftover
    quantile_normalize_to_tcga=False,     # set True for panel/cell-line data
)

result.snv_features      # (n_variants, 1169)  per-variant embeddings
result.cna_features      # (n_segments, 688)   per-segment embeddings
```

First call downloads the requested model variant from Hugging Face Hub (~185 MB) and, on first SNV call, the GRCh37 reference genome (~3 GB); both are cached locally.

**CSV column conventions:**

- **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
- **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH variant.

### When to set `quantile_normalize_to_tcga=True`

TESSERA was pretrained on TCGA whole-exome ABSOLUTE Segment_Means (median 0.000, IQR [0, +0.51]). Inputs whose log2-ratio distribution differs should be rank-mapped onto the TCGA reference before inference.

| Input type | Setting | Why |
|---|---|---|
| TCGA-like whole-exome ABSOLUTE | `False` (default) | Same distribution the model was pretrained on. |
| Panel sequencing (MSK-IMPACT, MSK-CHORD, GENIE) | **`True`** | Panel coverage compresses log2-ratios toward zero (KS = 0.38 vs TCGA). |
| Cell-line data (DepMap, CCLE) | **`True`** | Raw log2-ratios are right-shifted; DepMap median ≈ +1.0 vs TCGA's 0.0 (KS = 0.72). |

The bundled reference (`tessera/data/cna_sorted.npy`, 7 MB, 1.8 M segments) is loaded automatically when `True`. The helper `tessera.data.preprocessing.quantile_normalize_to_tcga` is also exposed if you'd rather pre-normalize.

### Lower-level building blocks

```python
from tessera import load_pretrained, lift_snv, lift_cna

model = load_pretrained("joint_snv_cna_noloh")          # download + instantiate
snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38")    # identity for GRCh37
result = model.featurize(snv_df=snv_df, cna_df=cna_df)  # reuse without re-downloading
```

UCSC chain files for liftover are downloaded on first use to `~/.cache/pyliftover/`; offline environments can supply a local file via `chain_file=` or the `TESSERA_LIFTOVER_CHAIN` env var.

## Reproducing the manuscript

For training, downstream analyses, and figure generation, clone the repo:

```bash
git clone https://github.com/JW-Sidhom-Lab/tessera.git
cd tessera
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
bash tessera/ref_genomes/download_ref_genomes.sh
```

The pipeline runs in three stages:

1. **Data preparation** ([`data/`](data/README.md)): per-cohort download instructions, source-table provenance, and the builders that turn raw releases into the analysis-ready CSVs.
2. **Foundation-model pretraining** ([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV models, the CNA models, and the joint SNV+CNA InfoNCE-aligned foundation model on the TCGA Pan-Cancer Atlas.
3. **Downstream analyses** ([`scripts/`](scripts/README.md)): variant-pathogenicity calibration, cross-platform validation, tumour-type classification, prognostic stratification, doubly-robust counterfactual treatment-effect estimation, and cell-line transfer.

[`scripts/README.md`](scripts/README.md) and [`data/README.md`](data/README.md) hold the per-directory tables linking each script and cohort to the relevant manuscript section.

## Repository layout

```
tessera/
├── tessera/                        # foundation-model package
│   ├── base.py                     # BaseModel: shared data + training infrastructure
│   ├── input_keys.py               # input-key helpers
│   ├── model.py                    # TESSERA: foundation-model class
│   ├── data/
│   │   └── preprocessing.py        # SNV/CNA tokenization, FASTA lookup, sample bagging
│   ├── layers/                     # custom Keras layers (attention, masking, MIL, ...)
│   ├── training/                   # training utilities (callbacks, losses, schedules)
│   └── ref_genomes/                # reference-genome download script + indices
├── data/                           # per-cohort data preparation pipelines (data/README.md)
├── scripts/                        # analysis pipelines backing the manuscript figures (scripts/README.md)
└── README.md
```

## Citing TESSERA

If you use TESSERA in your work, please cite:

> *citation pending publication*

A BibTeX entry will be added on acceptance.

## License

This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.

## Lab

TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.

For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.
