Metadata-Version: 2.4
Name: tcr-pmhc-analyzer
Version: 0.1.1
Summary: Analyzer for TCR-pMHC binding predictor outputs
Project-URL: Homepage, https://github.com/qbic-pipelines/tcr-pmhc-analyzer
Project-URL: Repository, https://github.com/qbic-pipelines/tcr-pmhc-analyzer.git
Project-URL: Issues, https://github.com/qbic-pipelines/tcr-pmhc-analyzer/issues
Author: Mark Polster
License-Expression: MIT
License-File: LICENSE
Keywords: analysis,peptide,tcr,tcr-pMHC,tcr-peptide,tcr-pmhc
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: matplotlib>=3.8.0
Requires-Dist: pandas>=3.0.0
Requires-Dist: rich-click>=1.9.6
Requires-Dist: scikit-learn>=1.4.0
Provides-Extra: dev
Requires-Dist: ruff==0.14.14; extra == 'dev'
Description-Content-Type: text/markdown

# tcr-pmhc-analyzer

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI version](https://img.shields.io/pypi/v/tcr-pmhc-analyzer.svg)](https://pypi.org/project/tcr-pmhc-analyzer/)

Analyzer for TCR-pMHC binding predictor outputs. Merges predictions from multiple models into a unified table, detects data leakage against bundled training sets, identifies seen/unseen peptides, and benchmarks model performance via ROC curves.

## Installation

```bash
pip install tcr-pmhc-analyzer
```

For development:

```bash
git clone https://github.com/qbic-pipelines/tcr-pmhc-analyzer.git
cd tcr-pmhc-analyzer
pip install -e ".[dev]"
```

## Input format

Both commands accept a TSV configuration file with two required columns:

| Column  | Description                              |
|---------|------------------------------------------|
| `model` | Name of the prediction model             |
| `file`  | Path to the model's prediction output    |

Example `input.tsv`:

```tsv
model	file
ergo2	results/ergo2_predictions.csv
mixtcrpred	results/mixtcrpred_predictions.csv
t2pmhc-gcn	results/t2pmhc_gcn_predictions.csv
```

Each prediction file must contain the following columns:

| Column         | Description                                      |
|----------------|--------------------------------------------------|
| `identifier`   | Unique sample identifier (used for merging)       |
| `binding_score` | Model's predicted binding score                  |
| `binder`       | Ground truth label (0/1), required for benchmarking |
| `peptide`      | Peptide sequence                                  |
| `cdr3a`        | CDR3 alpha chain sequence                         |
| `cdr3b`        | CDR3 beta chain sequence                          |
| `va`, `vb`     | V gene alpha/beta                                 |
| `ja`, `jb`     | J gene alpha/beta                                 |
| `mhc`          | MHC allele                                        |
| `organism`     | Source organism                                   |
| `mhc_class`    | MHC class                                         |

## Commands

### `create-analyzer-table`

Merges predictions from multiple models into a single table with rank-normalized scores, data leakage annotations, and seen-peptide flags.

```bash
tcr-pmhc-analyzer create-analyzer-table [OPTIONS]
```

| Option              | Short | Required | Description                                                 |
|---------------------|-------|----------|-------------------------------------------------------------|
| `--input PATH`      | `-i`  | Yes      | Path to TSV config file with `model` and `file` columns     |
| `--output PATH`     | `-o`  | Yes      | Output file path (`.csv` or `.tsv`)                         |
| `--ergo-version`    |       | If ergo2 | ERGO training data version: `vdjdb` or `mcpas`              |

**Example:**

```bash
tcr-pmhc-analyzer create-analyzer-table \
  -i input.tsv \
  -o analyzer_table.csv \
  --ergo-version vdjdb
```

**Output columns added:**
- `binding_score_{model}` — raw binding score per model
- `rank_score_{model}` — rank-normalized score in [0, 1] (1 = highest)
- `sample_in_train_{model}` — `True` if the sample appears in the model's training data (data leakage)
- `seen_in_{model}` — `True` if the peptide was seen in the model's training data

### `benchmark`

Generates ROC curve plots comparing model performance, split by seen vs unseen peptides. Data leakage samples are automatically removed before analysis.

```bash
tcr-pmhc-analyzer benchmark [OPTIONS]
```

| Option              | Short | Required | Description                                                        |
|---------------------|-------|----------|--------------------------------------------------------------------|
| `--input PATH`      | `-i`  | *        | Path to TSV config file with `model` and `file` columns            |
| `--table PATH`      |       | *        | Path to a pre-created analyzer table (alternative to `--input`)    |
| `--output PATH`     | `-o`  | Yes      | Output directory for ROC curve plots                               |
| `--ergo-version`    |       | If ergo2 | ERGO training data version: `vdjdb` or `mcpas`                     |
| `--models`          | `-m`  | No       | Space-separated list of models to benchmark (default: all available)|

\* Either `--input` or `--table` must be provided.

**Examples:**

```bash
# Benchmark from raw predictions
tcr-pmhc-analyzer benchmark -i input.tsv -o results/

# Benchmark from a pre-created analyzer table
tcr-pmhc-analyzer benchmark --table analyzer_table.csv -o results/

# Benchmark specific models only
tcr-pmhc-analyzer benchmark -i input.tsv -o results/ -m "ergo2 mixtcrpred tabr-bert"
```

**Output files:**
- `roc_curve_unseen.png` — ROC curves for peptides unseen by all selected models
- `roc_curve_seen.png` — ROC curves for peptides seen by all selected models

## Supported models

| Model          | Training data                          |
|----------------|----------------------------------------|
| `ergo2`        | mcpas or vdjdb (specify with `--ergo-version`) |
| `mixtcrpred`   | 146 pMHC training set                  |
| `t2pmhc-gcn`   | t2pmhc core training set               |
| `t2pmhc-gat`   | t2pmhc core training set               |
| `tabr-bert`    | TCR-pMHC training set                  |
| `tulip-tcr`    | TULIP training set                     |
| `atm-tcr`      | ATM-TCR training set                   |

## How it works

1. **Merge**: Prediction outputs from multiple models are merged on the `identifier` column into a single DataFrame.
2. **Rank normalization**: Each model's `binding_score` is rank-normalized to [0, 1] using descending order with average tie-breaking. NaN values are preserved.
3. **Data leakage detection**: Each sample is checked against bundled training data to flag samples that appear in a model's training set.
4. **Seen peptide detection**: Each peptide is checked against training data to identify whether it was seen during model training.
5. **Benchmarking**: ROC curves are generated after removing leaked samples, separately for seen and unseen peptides.

## Citations

If you use tcr-pmhc-analyzer in your research, please cite the underlying prediction models:

**ATM-TCR**
> Cai, M. et al. (2022). ATM-TCR: TCR-Epitope Binding Affinity Prediction Using a Multi-Head Self-Attention Model. *Frontiers in Immunology*, 13, 893247. https://doi.org/10.3389/fimmu.2022.893247

**ERGO-II**
> Springer, I. et al. (2021). Contribution of T Cell Receptor Alpha and Beta CDR3, MHC Typing, V and J Genes to Peptide Binding Prediction. *Frontiers in Immunology*, 12, 664514. https://doi.org/10.3389/fimmu.2021.664514

**MIXTCRpred**
> Croce, G. et al. (2024). Deep learning predictions of TCR-epitope interactions reveal epitope-specific chains in dual alpha T cells. *Nature Communications*, 15, 3211. https://doi.org/10.1038/s41467-024-47461-8

**t2pmhc**
> Polster, M. et al. (2026). t2pmhc: A Structure-Informed Graph Neural Network to Predict TCR-pMHC Binding. *bioRxiv*. https://doi.org/10.64898/2026.02.27.708137

**TABR-BERT**
> Zhang, J. et al. (2024). Accurate TCR-pMHC interaction prediction using a BERT-based transfer learning method. *Briefings in Bioinformatics*, 25(1), bbad436. https://doi.org/10.1093/bib/bbad436

**TULIP**
> Meynard-Piganeau, B. et al. (2024). TULIP — a Transformer-based Unsupervised Language model for Interacting Peptides and T-cell receptors that generalizes to unseen epitopes. *Proceedings of the National Academy of Sciences*, 121(13). https://doi.org/10.1073/pnas.2316401121

## License

[MIT](LICENSE)