Metadata-Version: 2.4
Name: mhctools
Version: 3.0.0
Summary: Python interface to running command-line and web-based MHC binding predictors
Author-email: Alex Rubinsteyn <alex@openvax.org>, Julia Kodysh <julia@openvax.org>, Tim O'Donnell <tim@openvax.org>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/openvax/mhctools
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.7
Requires-Dist: pandas>=0.13.1
Requires-Dist: varcode>=0.5.9
Requires-Dist: pyensembl<3.0.0,>=2.3.0
Requires-Dist: sercol>=0.0.2
Requires-Dist: mhcflurry>=2.0.0
Requires-Dist: mhcgnomes>=3.4.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: wheel; extra == "dev"
Dynamic: license-file

[![Tests](https://github.com/openvax/mhctools/actions/workflows/tests.yml/badge.svg)](https://github.com/openvax/mhctools/actions/workflows/tests.yml)
<a href="https://pypi.python.org/pypi/mhctools/">
<img src="https://img.shields.io/pypi/v/mhctools.svg?maxAge=1000" alt="PyPI" />
</a>

# mhctools

Python interface to running command-line and web-based MHC binding predictors.

## Installation

```sh
pip install mhctools
```

For MHCflurry support, also run:

```sh
mhcflurry-downloads fetch
```

## Quick start

```python
from mhctools import NetMHCpan41

predictor = NetMHCpan41(alleles=["HLA-A*02:01", "HLA-B*07:02"])

# predict for specific peptides
results = predictor.predict(["SIINFEKL", "GILGFVFTL"])

# results is a list of PeptidePreds — one per peptide
for pp in results:
    best = pp.best_affinity
    if best:
        print(f"{best.peptide} -> {best.allele} IC50={best.value:.1f}nM")
```

## Python API

### Predicting peptides

`predict()` takes a list of peptide sequences and returns a `list[PeptidePreds]`.
Each `PeptidePreds` contains `Pred` objects for every allele and measurement
kind the predictor supports.

```python
from mhctools import NetMHCpan41

predictor = NetMHCpan41(alleles=["HLA-A*02:01", "HLA-B*07:02"])
results = predictor.predict(["SIINFEKL", "GILGFVFTL"])

pp = results[0]
pp.best_affinity                # Pred with highest affinity score
pp.best_affinity.allele         # "HLA-A*02:01"
pp.best_affinity.value          # IC50 in nM
pp.best_affinity.score          # higher = better (~0-1)
pp.best_affinity.percentile_rank  # lower = better (0-100)

pp.best_affinity_by_rank        # Pred with lowest percentile rank
pp.best_presentation            # best EL/presentation score
pp.best_presentation_by_rank    # best EL percentile rank
pp.best_stability               # best pMHC stability (if available)
pp.best_stability_by_rank

# filter by kind or allele
pp.filter(kind=Kind.pMHC_affinity)
pp.filter(allele="HLA-A*02:01")
```

NetMHCpan 4.1 automatically emits both `pMHC_affinity` and `pMHC_presentation`
predictions per peptide-allele pair.

### Scanning proteins

`predict_proteins()` takes a dictionary of protein sequences and returns
`{sequence_name: list[PeptidePreds]}`:

```python
proteins = predictor.predict_proteins(
    {"TP53": "MEEPQSDPSVEPPLSQETFS...", "KRAS": "MTEYKLVVVGAGGVGKS..."},
    peptide_lengths=[9, 10],
)

for pp in proteins["TP53"]:
    best = pp.best_affinity
    if best and best.value < 500:
        print(f"  offset={best.offset} {best.peptide} IC50={best.value:.0f}")
```

### DataFrames

Every level has a `_dataframe` variant that flattens to a pandas DataFrame
with consistent columns:

```python
df = predictor.predict_dataframe(["SIINFEKL"], sample_name="pat001")
df = predictor.predict_proteins_dataframe({"TP53": "MEEPQ..."}, sample_name="pat001")
```

Columns: `sample_name`, `peptide`, `n_flank`, `c_flank`,
`source_sequence_name`, `offset`, `predictor_name`, `predictor_version`,
`allele`, `kind`, `score`, `value`, `percentile_rank`.

### Multi-sample predictions

`MultiSample` runs a predictor across multiple samples, each with its own
HLA genotype:

```python
from mhctools import MultiSample, NetMHCpan41

ms = MultiSample(
    samples={
        "pat001": ["HLA-A*02:01", "HLA-B*07:02"],
        "pat002": ["HLA-A*01:01", "HLA-B*08:01"],
    },
    predictor_class=NetMHCpan41,
)

# {sample_name: list[PeptidePreds]}
results = ms.predict(["SIINFEKL", "GILGFVFTL"])

# {sample_name: {seq_name: list[PeptidePreds]}}
protein_results = ms.predict_proteins({"TP53": "MEEPQ..."})

# flat DataFrames with sample_name column
df = ms.predict_dataframe(["SIINFEKL"])
df = ms.predict_proteins_dataframe({"TP53": "MEEPQ..."})
```

### Measurement kinds

The `Kind` enum describes what biological quantity a `Pred` measures:

| Kind | Meaning |
|---|---|
| `pMHC_affinity` | Peptide-MHC binding affinity |
| `pMHC_presentation` | Likelihood of surface presentation (EL) |
| `pMHC_stability` | Peptide-MHC complex stability |
| `cellular_presentation` | Cross-allele presentation (e.g. MHCflurry) |
| `antigen_processing` | Combined processing score |
| `proteasome_cleavage` | Proteasomal cleavage score |
| `tap_transport` | TAP transport score |
| `erap_trimming` | ERAP trimming score |

### The Pred object

Every prediction is a frozen, self-contained `Pred` dataclass:

```python
from mhctools import Pred, Kind

pred = Pred(
    kind=Kind.pMHC_affinity,
    score=0.85,           # ~0-1, higher = better
    peptide="SIINFEKL",
    allele="HLA-A*02:01",
    value=120.5,          # IC50 in nM
    percentile_rank=0.8,
    source_sequence_name="TP53",
    offset=42,
    predictor_name="netMHCpan",
    predictor_version="4.1",
)
```

`score` is always higher-is-better. `value` is in native units (nM for
affinity, hours for stability). `percentile_rank` is always optional,
0-100, lower = stronger.

## Supported predictors

| Predictor | Kinds produced | Requires |
|---|---|---|
| `NetMHCpan` / `NetMHCpan41` | affinity + presentation | [NetMHCpan](http://www.cbs.dtu.dk/services/NetMHCpan/) |
| `NetMHCpan4` | affinity or presentation | NetMHCpan 4.0 |
| `NetMHCpan3` / `NetMHCpan28` | affinity | older NetMHCpan |
| `NetMHC` / `NetMHC3` / `NetMHC4` | affinity | [NetMHC](http://www.cbs.dtu.dk/services/NetMHC/) |
| `NetMHCIIpan` | affinity or presentation | [NetMHCIIpan](http://www.cbs.dtu.dk/services/NetMHCIIpan/) |
| `NetMHCcons` | affinity | [NetMHCcons](http://www.cbs.dtu.dk/services/NetMHCcons/) |
| `NetMHCstabpan` | stability | [NetMHCstabpan](http://www.cbs.dtu.dk/services/NetMHCstabpan/) |
| `MHCflurry` | affinity | `pip install mhcflurry` + `mhcflurry-downloads fetch` |
| `MixMHCpred` | presentation | [MixMHCpred](https://github.com/GfellerLab/MixMHCpred) |
| `RandomBindingPredictor` | affinity | (built-in) |
| `NetChop` | cleavage | [NetChop](http://www.cbs.dtu.dk/services/NetChop/) |

## Commandline examples

### Prediction for user-supplied peptide sequences

```sh
mhctools --sequence SIINFEKL SIINFEKLQ --mhc-predictor netmhc --mhc-alleles A0201
```

### Automatically extract peptides as subsequences of specified length

```sh
mhctools --sequence AAAQQQSIINFEKL --extract-subsequences --mhc-peptide-lengths 8-10 --mhc-predictor mhcflurry --mhc-alleles A0201
```

## Legacy API

The old `predict_peptides()` and `predict_subsequences()` methods still work
and return `BindingPredictionCollection` objects:

```python
predictor = NetMHCpan(alleles=["A*02:01"])
collection = predictor.predict_subsequences(
    {"1L2Y": "NLYIQWLKDGGPSSGRPPPS"},
    peptide_lengths=[9],
)
df = collection.to_dataframe()

for bp in collection:
    if bp.affinity < 100:
        print("Strong binder: %s" % bp)
```

To convert legacy results to the new types:

```python
preds = collection.to_preds()           # list of Pred
pp_list = collection.to_peptide_preds() # list of PeptidePreds
```
