Metadata-Version: 2.4
Name: proteinfp
Version: 0.1.1
Summary: End-to-end protein function prediction and drug candidate design
Author: ProteinFP Contributors
License: MIT
Project-URL: Homepage, https://github.com/wowcowdowjones/proteinFP2
Project-URL: Repository, https://github.com/wowcowdowjones/proteinFP2
Project-URL: Bug Tracker, https://github.com/wowcowdowjones/proteinFP2/issues
Keywords: bioinformatics,drug-discovery,protein,cheminformatics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28
Requires-Dist: tqdm>=4.64
Requires-Dist: numpy>=1.24
Requires-Dist: click>=8.1
Requires-Dist: pyyaml>=6.0
Requires-Dist: colorlog>=6.7
Requires-Dist: biopython>=1.81
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Provides-Extra: structure
Requires-Dist: freesasa>=2.1; extra == "structure"
Provides-Extra: ml
Requires-Dist: torch>=2.0; extra == "ml"
Requires-Dist: xgboost>=2.0; extra == "ml"
Requires-Dist: lightgbm>=4.0; extra == "ml"
Requires-Dist: fair-esm>=2.0; extra == "ml"
Provides-Extra: chem
Requires-Dist: rdkit>=2023.3; extra == "chem"
Provides-Extra: grn
Requires-Dist: scanpy>=1.9; extra == "grn"
Requires-Dist: anndata>=0.9; extra == "grn"
Provides-Extra: sim
Requires-Dist: openmm>=8.0; extra == "sim"
Provides-Extra: all
Requires-Dist: proteinfp[structure]; extra == "all"
Requires-Dist: proteinfp[ml]; extra == "all"
Requires-Dist: proteinfp[chem]; extra == "all"
Requires-Dist: proteinfp[grn]; extra == "all"
Requires-Dist: proteinfp[sim]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"

# ProteinFP

**End-to-end protein function prediction and drug candidate design.**

Give it a UniProt ID. Get back active sites, druggable pockets, allosteric sites, EC classification, GO terms, PPI partners, therapy modality recommendations, and, with AutoDock Vina, evolved drug candidate molecules. For any protein, any disease, any organism.

```bash
pip install proteinfp
proteinfp --uniprot P28593   # Trypanothione reductase (Chagas disease)
```

```
  Protein    : Trypanothione reductase
  Gene       : TPR
  Organism   : Trypanosoma cruzi
  Confidence : VERY HIGH

  Top function     : Trypanothione is the parasite analog of glutathione
  Enzyme           : yes — EC 1.8.1.12
  Pockets          : 10 (all druggability > 0.90)
  Therapy          : SMALL_MOLECULE → active site inhibitor
```

---

## What it does

ProteinFP runs 13+ prediction modules in sequence, fusing their outputs into a single ranked, confidence-weighted report.

| Module | What it predicts |
|--------|-----------------|
| 01 | AlphaFold structure + UniProt metadata |
| 02 | Surface charge, hydrophobicity, SASA |
| 03 | Catalytic residues and active site motifs |
| 04 | Druggable binding pockets (geometry + druggability score) |
| 05 | Allosteric sites (elastic network model) |
| 06 | Chemical environment of each site |
| 07 | Sequence homologs with known function (BLAST + InterPro) |
| 08 | ESM-2 protein language model embeddings (650M parameters) |
| 09 | GO term prediction (Molecular Function, Biological Process, Cellular Component) |
| 10 | Enzyme class prediction — ML ensemble (XGBoost + LightGBM + MLP, ~97% accuracy) |
| 11 | Structural analogs via Foldseek (finds same-fold proteins regardless of sequence) |
| 12 | Protein-protein interactions (STRING DB) |
| 13 | Consensus report — fuses all evidence into a ranked, confidence-scored output |
| 14 | Molecular dynamics — RMSF, flexibility, cryptic pockets *(needs OpenMM)* |
| 15 | De novo molecular design — evolutionary drug candidate generation *(needs Vina + RDKit)* |
| 17 | Post-translational modification sites and their functional consequences |

**GRN + SIM pipeline** (disease-aware mode — requires scRNA-seq data):

| Module | What it does |
|--------|-------------|
| GRN-01 | scRNA-seq preprocessing — HVG selection, QC filtering |
| GRN-02 | GENIE3 gene regulatory network reconstruction |
| GRN-03 | Therapy modality decision — surface vs intracellular, ADC vs small molecule |
| SIM-01 | Tumor cell environment inference from marker gene expression |
| SIM-02 | Protein conformational ensemble in that environment |
| SIM-03 | Drug distribution across cell compartments |
| SIM-04 | Binding probability under real physiological conditions |
| SIM-05 | GRN perturbation — network-level consequences of drug binding |
| SIM-06 | Pharmacological scoring — efficacy, selectivity, resistance risk, grade A–F |

---

## Installation

```bash
pip install proteinfp
```

**Core pipeline** (Modules 01–13, 17) works out of the box. Optional features:

```bash
pip install proteinfp[ml]        # ESM-2 embeddings + ML EC classifier
pip install proteinfp[structure] # SASA/DSSP surface analysis
pip install proteinfp[chem]      # De novo molecular design (RDKit)
pip install proteinfp[grn]       # GRN/scRNA-seq modules (scanpy)
pip install proteinfp[sim]       # Molecular dynamics (OpenMM)
pip install proteinfp[all]       # Everything
```

For de novo design you also need [AutoDock Vina](https://vina.scripps.edu/downloads/).

Check what's available on your machine:

```bash
proteinfp --check-deps
```

---

## Quick start

```bash
# Any protein, just a UniProt ID
proteinfp --uniprot P04637       # TP53 (human tumour suppressor)
proteinfp --uniprot P28593       # Trypanothione reductase (Chagas disease)
proteinfp --uniprot P9WGR1       # InhA (drug-resistant TB)

# Force re-run even if report already exists
proteinfp --uniprot P04637 --force

# With therapy decision + de novo molecule design
proteinfp --uniprot P28593 --therapy --denovo --vina /path/to/vina

# With molecular dynamics
proteinfp --uniprot P28593 --md

# Show all modules and their status
proteinfp --list-modules
```

Reports are saved to `data/reports/{UNIPROT}_report.json` and `_report.txt`.

---

## Therapy mode

After the core pipeline runs, `--therapy` makes modality decisions automatically:

- **Surface protein** → antibody path: ranks epitope candidates by immunogenicity and accessibility
- **Intracellular with druggable pocket** → small molecule path: triggers de novo design
- **Epigenetic regulator** → adds PROTAC degrader as secondary recommendation
- **Allosteric site only** → allosteric small molecule

```bash
proteinfp --uniprot P28593 --therapy --denovo --vina pipeline/vina.exe
```

```
  → Primary modality : SMALL_MOLECULE
  → Confidence       : HIGH
  • Intracellular with druggable pocket P1 (vol=1800Å³, drug=0.90)
  • Enzyme (EC 1.8.1.12) — active site inhibition most direct mechanism
```

---

## Disease-agnostic design

The pipeline works on any protein from any organism. To switch disease context, edit one file:

```yaml
# config/disease_config.yaml
disease:
  name: "TB"
  organism: "Mycobacterium tuberculosis"
  organism_id: 83332

data:
  scrnaseq_input: "data/grn/input/your_mtb_data.csv"

driver_genes:
  - katG   # isoniazid target
  - inhA   # isoniazid target
  - rpoB   # rifampicin target
  - gyrA   # fluoroquinolone target
```

Ready-to-use configs for LUAD, CRC, TB, and Leishmaniasis are included in the file.

---

## Project structure

```
proteinfp/
├── config/
│   ├── config.yaml            ← paths, API thresholds, tool settings
│   └── disease_config.yaml    ← switch disease/organism here
├── pipeline/                  ← Modules 01–17
├── proteinfp/                 ← CLI package (pip install proteinfp)
│   ├── cli.py                 ← proteinfp --uniprot X
│   ├── orchestrator.py        ← runs all modules gracefully
│   ├── therapy.py             ← therapy decision + epitope + de novo
│   └── deps.py                ← optional dependency checker
├── sim/                       ← SIM-01 to SIM-07 (whole-cell simulation)
├── grn/                       ← GRN-01 to GRN-03 (gene regulatory network)
├── utils/                     ← config loader, PDB parser
├── tests/                     ← test suite (pytest)
├── validation/                ← validation against known drug-protein pairs
├── train/                     ← ML model training scripts
├── models/                    ← EC classifier ensemble (metadata only in repo)
└── pyproject.toml             ← pip install configuration
```

---

## Running tests

```bash
python -m pytest tests/ -v
```

---

## Reproducibility

All outputs are deterministic given the same input. Every inference step saves a JSON to `data/intermediate/` so individual modules can be re-run or inspected without rerunning the full pipeline.

---

## Citation

If you use ProteinFP in your research, please cite this repository. A methods paper describing the pipeline is in preparation.

---

## License

MIT
