Metadata-Version: 2.4
Name: itgap
Version: 0.1.0
Summary: Integrated TCR-Gene-Antigen Prediction: dataset tooling and models for TCR-antigen recognition.
Project-URL: Homepage, https://github.com/mlizhangx/ITGAP
Project-URL: Repository, https://github.com/mlizhangx/ITGAP
Project-URL: Issues, https://github.com/mlizhangx/ITGAP/issues
Author-email: Li Zhang Lab <mlizhang@gmail.com>
License: MIT
License-File: LICENSE
Keywords: antigen-prediction,bioinformatics,immunology,machine-learning,single-cell,tcr
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: matplotlib>=3.10
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: scikit-learn>=1.5
Provides-Extra: all
Requires-Dist: anndata>=0.10; extra == 'all'
Requires-Dist: scanpy>=1.10; extra == 'all'
Requires-Dist: tensorflow-macos>=2.9; (platform_system == 'Darwin' and platform_machine == 'arm64') and extra == 'all'
Requires-Dist: tensorflow>=2.9; (platform_system != 'Darwin' or platform_machine != 'arm64') and extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: twine>=5; extra == 'dev'
Provides-Extra: gex
Requires-Dist: anndata>=0.10; extra == 'gex'
Requires-Dist: scanpy>=1.10; extra == 'gex'
Provides-Extra: tf
Requires-Dist: tensorflow-macos>=2.9; (platform_system == 'Darwin' and platform_machine == 'arm64') and extra == 'tf'
Requires-Dist: tensorflow>=2.9; (platform_system != 'Darwin' or platform_machine != 'arm64') and extra == 'tf'
Description-Content-Type: text/markdown

# ITGAP — Integrated TCR-Gene-Antigen Prediction

`itgap` is a Python package for building TCR-peptide datasets from the 10x
Genomics CD8+ T-cell multi-omics benchmark and training TCR-antigen
recognition models that integrate gene expression (GEX), V/J gene usage, and
CDR3 sequence information.

It bundles:

- `NegativeSamplingTool` — two-stage synthetic negative TCR-peptide sampling.
- Sequence encoding utilities (Atchley factors + positional encoding) with the
  Atchley table shipped as a package resource.
- Autoencoder + encoder–decoder integration models for combining sequence,
  V/J, and GEX modalities.
- Residual-MLP binary classifiers, sklearn baselines (logistic regression,
  random forest), and standard evaluation/plotting helpers.

## Install

Core install (small footprint, only `numpy`, `pandas`, `scikit-learn`,
`matplotlib`):

```bash
pip install itgap
```

Optional extras:

```bash
pip install 'itgap[gex]'   # adds scanpy + anndata for h5ad loading
pip install 'itgap[tf]'    # adds tensorflow (or tensorflow-macos on Apple Silicon)
pip install 'itgap[all]'   # everything
```

`itgap[tf]` resolves to `tensorflow-macos>=2.9` on macOS arm64 and to
`tensorflow>=2.9` elsewhere.

## Data

The package ships only the small `atchley.txt` reference table. The large
benchmark file `merge_gex_all_donors_all_peptides_meta.h5ad` (~200 MB) is
**not** included; download it from the
[10x Genomics CD8+ T-cell multi-omics dataset](https://www.10xgenomics.com/datasets)
and pass its path to `NegativeSamplingTool(data_dir=...)` or to
`load_dataset(h5ad_path=...)`. The pre-computed CSV embeddings used in the
example notebooks live in the project repository under `examples/data/`.

## Quickstart

Generate a negative-sampled training set:

```python
from itgap import NegativeSamplingTool

tool = NegativeSamplingTool(
    data_dir="path/to/10x",   # contains merge_gex_all_donors_all_peptides_meta.h5ad
    negative_ratio=3.0,
    random_seed=42,
)
result = tool.create_combined_dataset(negative_ratio=3.0)
print(result["dataset"].shape, result["statistics"])
```

Train a residual-MLP classifier on assembled features (requires `itgap[tf]`):

```python
from itgap import (
    load_atchley, build_residual_mlp, compile_and_train, evaluate_classifier,
)

word_vectors, aa_idx = load_atchley()   # uses the packaged atchley.txt
model = build_residual_mlp(input_dim=X_train.shape[1])
history = compile_and_train(model, X_train, y_train, X_val, y_val, epochs=50)
metrics = evaluate_classifier(model, X_test, y_test)
```

## Command-line

Installing the package exposes a console script:

```bash
itgap-negative-sampling   # runs NegativeSamplingTool with default settings
```

## Examples

End-to-end Jupyter notebooks live in
[`examples/`](examples/) of the repository:

- `data_preparation_notebook.ipynb` — build the labeled dataset.
- `tcr_beta_prediction_notebook.ipynb` — beta-chain only model.
- `tcr_alpha_beta_prediction_notebook.ipynb` — alpha + beta + GEX + VJ.

## Development

```bash
pip install -e '.[dev,all]'
pytest
python -m build
```

## License

MIT. See [LICENSE](LICENSE).

## Citation

If you use ITGAP in a publication, please cite the project repository:
<https://github.com/mlizhangx/ITGAP>.
