Metadata-Version: 2.4
Name: tessera-foundation
Version: 0.1.0
Summary: TESSERA: a foundation model for the cancer genome (joint SNV+CNA self-supervised pretraining).
Author-email: John-William Sidhom <johnwilliamsidhom@gmail.com>
License: PolyForm Noncommercial License 1.0.0
        
        <https://polyformproject.org/licenses/noncommercial/1.0.0>
        
        ## Acceptance
        
        In order to get any license under these terms, you must agree
        to them as both strict obligations and conditions to all
        your licenses.
        
        ## Copyright License
        
        The licensor grants you a copyright license for the
        software to do everything you might do with the software
        that would otherwise infringe the licensor's copyright
        in it for any permitted purpose.  However, you may
        only distribute the software according to [Distribution
        License](#distribution-license) and make changes or new works
        based on the software according to [Changes and New Works
        License](#changes-and-new-works-license).
        
        ## Distribution License
        
        The licensor grants you an additional copyright license
        to distribute copies of the software.  Your license
        to distribute covers distributing the software with
        changes and new works permitted by [Changes and New Works
        License](#changes-and-new-works-license).
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of
        the software from you also gets a copy of these terms or the
        URL for them above, as well as copies of any plain-text lines
        beginning with `Required Notice:` that the licensor provided
        with the software.  For example:
        
        > Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
        > TESSERA is licensed for academic and non-commercial use only.
        > Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
        
        ## Changes and New Works License
        
        The licensor grants you an additional copyright license to
        make changes and new works based on the software for any
        permitted purpose.
        
        ## Patent License
        
        The licensor grants you a patent license for the software that
        covers patent claims the licensor can license, or becomes able
        to license, that you would infringe by using the software.
        
        ## Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        ## Personal Uses
        
        Personal use for research, experiment, and testing for
        the benefit of public knowledge, personal study, private
        entertainment, hobby projects, amateur pursuits, or religious
        observance, without any anticipated commercial application,
        is use for a permitted purpose.
        
        ## Noncommercial Organizations
        
        Use by any charitable organization, educational institution,
        public research organization, public safety or health
        organization, environmental protection organization,
        or government institution is use for a permitted purpose
        regardless of the source of funding or obligations resulting
        from the funding.
        
        ## Fair Use
        
        You may have "fair use" rights for the software under the
        law. These terms do not limit them.
        
        ## No Other Rights
        
        These terms do not allow you to sublicense or transfer any of
        your licenses to anyone else, or prevent the licensor from
        granting licenses to anyone else.  These terms do not imply
        any other licenses.
        
        ## Patent Defense
        
        If you make any written claim that the software infringes or
        contributes to infringement of any patent, your patent license
        for the software granted under these terms ends immediately. If
        your company makes such a claim, your patent license ends
        immediately for work on behalf of your company.
        
        ## Violations
        
        The first time you are notified in writing that you have
        violated any of these terms, or done anything with the software
        not covered by your licenses, your licenses can nonetheless
        continue if you come into full compliance with these terms,
        and take practical steps to correct past violations, within
        32 days of receiving notice.  Otherwise, all your licenses
        end immediately.
        
        ## No Liability
        
        ***As far as the law allows, the software comes as is, without
        any warranty or condition, and the licensor will not be liable
        to you for any damages arising out of these terms or the use
        or nature of the software, under any kind of legal claim.***
        
        ## Definitions
        
        The **licensor** is the individual or entity offering these
        terms, and the **software** is the software the licensor makes
        available under these terms.
        
        **You** refers to the individual or entity agreeing to these
        terms.
        
        **Your company** is any legal entity, sole proprietorship,
        or other kind of organization that you work for, plus all
        organizations that have control over, are under the control of,
        or are under common control with that organization.  **Control**
        means ownership of substantially all the assets of an entity,
        or the power to direct its management and policies by vote,
        contract, or otherwise.  Control can be direct or indirect.
        
        **Your licenses** are all the licenses granted to you for the
        software under these terms.
        
        **Use** means anything you do with the software requiring one
        of your licenses.
        
        ---
        
        Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
        TESSERA is licensed for academic and non-commercial use only.
        Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
        
Project-URL: Homepage, https://github.com/JW-Sidhom-Lab/tessera
Project-URL: Repository, https://github.com/JW-Sidhom-Lab/tessera
Project-URL: Model weights, https://huggingface.co/JW-Sidhom-Lab/tessera-foundation
Project-URL: Issues, https://github.com/JW-Sidhom-Lab/tessera/issues
Keywords: cancer-genomics,foundation-model,self-supervised-learning,tcga,somatic-variants,copy-number-alterations,bioinformatics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tensorflow>=2.16
Requires-Dist: numpy
Requires-Dist: pandas>=2.0
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Requires-Dist: pyfaidx>=0.7
Requires-Dist: pyliftover>=0.4
Requires-Dist: tqdm>=4.66
Requires-Dist: huggingface_hub>=0.20
Dynamic: license-file

<p align="center">
  <img src="logo.png" alt="TESSERA logo" width="220">
</p>

<p align="center">
  <em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
  A foundation model for the cancer genome.
</p>

---

TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.

This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.

## Quick start

The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:

🔗 **Inference API**: [huggingface.co/spaces/JW-Sidhom-Lab/tessera](https://huggingface.co/spaces/JW-Sidhom-Lab/tessera) *(coming soon)*

From Python (`pip install gradio_client`):

```python
import time
from gradio_client import Client, handle_file

client = Client("JW-Sidhom-Lab/tessera")        # the public Spaces URL also works

# Submit returns (status_html, job_id) immediately; inference runs async
_, job_id = client.predict(
    handle_file("snv.csv"),         # SNV CSV; or None
    handle_file("cna.csv"),         # CNA CSV; or None. At least one required.
    True,                           # apply TCGA quantile normalization to CNA
    "you@example.com",              # email address for the download link
    "GRCh37",                       # genome assembly: "GRCh37" or "GRCh38"
    api_name="/submit",
)

# Poll for completion (the same URL also gets emailed when the job finishes)
while True:
    status = client.predict(job_id, api_name="/status")
    if status["status"] in ("done", "failed"):
        break
    time.sleep(10)

print(status["url"])    # 24h pre-signed S3 download URL with the result ZIP
```

The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as `.npy` files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.

CSV column conventions:

- **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
- **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH model variant.

## Local installation

For users who want to run inference offline, integrate TESSERA into a custom pipeline, or retrain on their own data:

```bash
# Clone
git clone https://github.com/JW-Sidhom-Lab/tessera.git
cd tessera

# Recommended: a virtual environment so deps don't clash with system Python
python3 -m venv .venv && source .venv/bin/activate

# Install all dependencies
pip install -r requirements.txt

# Download reference genome (default: GRCh37)
bash tessera/ref_genomes/download_ref_genomes.sh
```

`requirements.txt` covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at [`inference_api/requirements.txt`](inference_api/requirements.txt).

Weights are hosted on Hugging Face Hub at [huggingface.co/JW-Sidhom-Lab/tessera-foundation](https://huggingface.co/JW-Sidhom-Lab/tessera-foundation) under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the `featurize` one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:

```python
import tessera

result = tessera.featurize(
    snv_df=snv_df,                      # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
                                        #          Reference_Allele, Tumor_Seq_Allele2, vaf
    cna_df=cna_df,                      # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
    variant="joint_snv_cna_noloh",      # or "joint_snv_cna" for the with-LoH variant
    from_assembly="GRCh38",             # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
)

result.snv_features      # (n_variants, 1169)  per-variant embeddings, row-aligned with result.snv_table
result.cna_features      # (n_segments, 688)   per-segment embeddings, row-aligned with result.cna_table
result.liftover_stats    # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}
```

For finer-grained control there are still building blocks:

```python
from tessera import load_pretrained, lift_snv, lift_cna

model = load_pretrained("joint_snv_cna_noloh")    # download + instantiate, ~3 s cold
snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38")    # identity if from_assembly=="GRCh37"
cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
result = model.featurize(snv_df=snv_df, cna_df=cna_df)  # repeat without re-downloading
```

UCSC chain files are downloaded on first use and cached at `~/.cache/pyliftover/`; offline environments can point the loader at a bundled chain file via the `chain_file=` argument or the `TESSERA_LIFTOVER_CHAIN` environment variable.

## Reproducing the manuscript

Every published panel is backed by a script in this repository. The
pipeline runs in three stages:

1. **Data preparation** ([`data/`](data/README.md)): per-cohort
   download instructions, source-table provenance, and the
   `create_training_data*.py` / `build_<cohort>_metadata.py` builders
   that turn raw releases into the analysis-ready CSVs.
2. **Foundation-model pretraining**
   ([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV
   models, the CNA models, and the joint SNV+CNA InfoNCE-aligned
   foundation model on the TCGA Pan-Cancer Atlas.
3. **Downstream analyses** ([`scripts/`](scripts/README.md)):
   variant-pathogenicity (Fig. 1 h-o), cross-platform validation
   (Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3,
   Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust
   counterfactual treatment-effect (Fig. 6 a-m), and DepMap
   cell-line transfer (Fig. 6 n).

[`scripts/README.md`](scripts/README.md) and
[`data/README.md`](data/README.md) hold the full per-directory tables
mapping each script and cohort to its manuscript figure.

## Repository layout

```
tessera/
├── tessera/                        # foundation-model package
│   ├── base.py                     # BaseModel: shared data + training infrastructure
│   ├── input_keys.py               # input-key helpers
│   ├── model.py                    # TESSERA: foundation-model class
│   ├── data/
│   │   └── preprocessing.py        # SNV/CNA tokenization, FASTA lookup, sample bagging
│   ├── layers/                     # custom Keras layers (attention, masking, MIL, ...)
│   ├── training/                   # training utilities (callbacks, losses, schedules)
│   └── ref_genomes/                # reference-genome download script + indices
├── data/                           # per-cohort data preparation pipelines (data/README.md)
├── scripts/                        # analysis pipelines backing the manuscript figures (scripts/README.md)
└── README.md
```

## Citing TESSERA

If you use TESSERA in your work, please cite:

> *citation pending publication*

A BibTeX entry will be added on acceptance.

## License

This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.

## Lab

TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.

For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.
