Metadata-Version: 2.4
Name: wtfdtb
Version: 0.1.0
Summary: Inverse virtual screening — dock one ligand against a whole protein library via GNINA.
Author: Chandragupt Sharma
License-Expression: MIT
Keywords: bioinformatics,cheminformatics,docking,drug-discovery,target-fishing,virtual-screening
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.10
Requires-Dist: biopython>=1.80
Requires-Dist: dimorphite-dl>=1.3
Requires-Dist: gemmi
Requires-Dist: meeko>=0.5
Requires-Dist: openmm>=8.0
Requires-Dist: pandas>=2.0
Requires-Dist: pdb-tools>=2.5
Requires-Dist: pdb2pqr>=3.6
Requires-Dist: pdbfixer>=1.9
Requires-Dist: prolif>=2.0
Requires-Dist: rdkit
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Description-Content-Type: text/markdown

# WTFDTB — High-Throughput Inverse Virtual Screening

> **Target Fishing**: Dock a single small-molecule ligand against a library of macromolecular protein structures using a state-of-the-art ML/DL stack.

![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)
![License: MIT](https://img.shields.io/badge/license-MIT-green)
![Status: Alpha](https://img.shields.io/badge/status-alpha-orange)

---

## What Is This?

Traditional virtual screening docks many ligands against one protein target. **WTFDTB flips this**: it docks **one ligand** against **many proteins** to answer the question — *"What targets does this drug bind?"*

This is called **inverse virtual screening** (or *target fishing*), and it's essential for:

- **Drug repurposing** — finding new uses for existing drugs
- **Off-target prediction** — identifying potential side effects  
- **Polypharmacology** — understanding multi-target drug activity
- **Natural product target deconvolution** — identifying targets for bioactive compounds

WTFDTB automates the entire workflow from a raw ligand file to a ranked CSV of protein targets with interaction fingerprints — no manual intervention needed.

---

## Pipeline Architecture

The pipeline runs in 5 sequential phases:

```
  ┌──────────────┐    ┌────────────────────┐    ┌──────────────────┐
  │  1. Ligand   │───▶│  2. Receptor       │───▶│  3. Pocket       │
  │     Prep     │    │     Curation       │    │     Detection    │
  │              │    │     (parallel)      │    │                  │
  │ Dimorphite-DL│    │ PDBFixer + PDB2PQR │    │     P2Rank       │
  │ RDKit + Meeko│    │ + PROPKA + Meeko   │    │     (Java ML)    │
  └──────────────┘    └────────────────────┘    └──────────────────┘
                                                         │
         ┌───────────────────────────────────────────────┘
         ▼
  ┌──────────────────┐    ┌──────────────────────┐
  │  4. Docking      │───▶│  5. Post-Docking      │
  │     (parallel)   │    │     Analysis           │
  │                  │    │                        │
  │     GNINA        │    │ ProLIF + Pandas        │
  │  (CNN-rescored)  │    │ Filter → Rank → CSV   │
  └──────────────────┘    └──────────────────────┘
```

### Phase Details

| Phase | Module | Tools | What It Does |
|-------|--------|-------|--------------|
| **1. Ligand Prep** | `ligand_prep.py` | Dimorphite-DL, RDKit, Meeko | Enumerate protonation states at target pH, generate 3D conformer (ETKDGv3 + MMFF94), produce PDBQT with Gasteiger charges |
| **2. Receptor Curation** | `receptor_curation.py` | PDBFixer, PDB2PQR, PROPKA, pdb-tools | Download PDB from RCSB, strip HETATM/water, repair missing heavy atoms, protonate at target pH, parallelised across all targets |
| **3. Pocket Detection** | `pocket_detection.py` | P2Rank (Java) | ML-based druggable pocket prediction — no template bias, detects all possible binding sites per protein |
| **4. Docking** | `docking.py` | GNINA (C++) | CNN-rescored molecular docking for each pocket × ligand combination, parallelised with ProcessPoolExecutor |
| **5. Post-Docking** | `post_dock.py` | ProLIF, Pandas | Compute interaction fingerprints (H-bond, hydrophobic, π-stacking, salt bridge), apply CNNscore filter, rank by CNNaffinity, export CSV |

---

## Installation

### Option A: Conda / Mamba (Recommended)

```bash
# Create environment with all dependencies including GNINA and Java
mamba create -n wtfdtb python=3.12
mamba activate wtfdtb
pip install -e .
```

### Option B: From Source (Development)

```bash
git clone https://github.com/ChandraguptSharma07/WTFDTB.git
cd WTFDTB
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
pip install -e ".[dev]"
```

### External Dependencies

These binaries must be available on `PATH`:

| Tool | Purpose | Install |
|------|---------|---------|
| **GNINA** | CNN-rescored docking engine | [github.com/gnina/gnina](https://github.com/gnina/gnina) or `mamba install gnina` |
| **P2Rank** | ML pocket detection | [github.com/rdk/p2rank](https://github.com/rdk/p2rank) — requires Java ≥ 11 |
| **Java ≥ 11** | Required by P2Rank | `mamba install openjdk` |

Set `PRANK_HOME` to the P2Rank installation directory if it's not on your PATH:

```bash
export PRANK_HOME=/path/to/p2rank_2.4.2
```

---

## Quick Start

### Basic Usage

```bash
# Screen aspirin against 3 known kinase targets
echo "1EQG
2HZI
3K5V" > targets.txt

wtfdtb screen \
  --ligand aspirin.sdf \
  --targets targets.txt \
  --output results.csv
```

### Using PDB IDs from a Text File

```bash
# targets.txt — one PDB ID per line
wtfdtb screen \
  --ligand my_compound.smi \
  --targets targets.txt \
  --output hits.csv \
  --ph 7.4 \
  --exhaustiveness 8 \
  --workers 4
```

### Using a Directory of PDB Files

```bash
# Directory containing .pdb files
wtfdtb screen \
  --ligand drug.sdf \
  --targets ./protein_library/ \
  --output results.csv
```

### SMILES Input

The ligand can be a `.smi` file with SMILES notation:

```bash
echo "CC(=O)Oc1ccccc1C(=O)O aspirin" > aspirin.smi
wtfdtb screen --ligand aspirin.smi --targets targets.txt -o results.csv
```

---

## CLI Reference

```
wtfdtb screen [OPTIONS]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--ligand`, `-l` | Path | *required* | Input ligand file (`.sdf`, `.mol`, `.mol2`, `.smi`) |
| `--targets`, `-t` | Path | *required* | Protein target library — directory of `.pdb` files or text file of PDB IDs |
| `--output`, `-o` | Path | `results.csv` | Output CSV path for ranked docking results |
| `--ph` | float | `7.4` | Physiological pH for ligand and receptor protonation |
| `--box-size` | int | `25` | Side length (Å) of the cubic docking search box |
| `--cnn-model` | str | `default` | GNINA CNN model (`default`, `dense`, or path to weights) |
| `--cnn-score-threshold` | float | `0.5` | Minimum CNNscore (0–1) to accept a pose |
| `--min-interactions` | int | `1` | Minimum protein-ligand interactions to keep a pose (0 = no filter) |
| `--workers`, `-w` | int | CPU count | Parallel workers for receptor curation and docking |
| `--exhaustiveness` | int | `8` | GNINA search exhaustiveness (higher = slower, more thorough) |
| `--verbosity` | int | `1` | Logging: 0 = quiet, 1 = normal, 2 = debug |
| `--version`, `-v` | — | — | Show version and exit |

---

## Output Format

The output CSV is primarily ranked by **Vina affinity** (ascending = tighter predicted binding in kcal/mol), with **CNNaffinity** (pKd) used to break ties:

| Column | Description |
|--------|-------------|
| `rank` | Overall rank (1 = best predicted binder) |
| `pdb_id` | Target protein PDB ID |
| `pocket` | Binding pocket name (from P2Rank) |
| `pose_rank` | Pose rank within this pocket (from GNINA) |
| `cnn_score` | GNINA CNN confidence score (0–1, higher = more accurate pose) |
| `cnn_affinity` | GNINA CNN-predicted binding affinity (pKd, higher = tighter) |
| `vina_affinity` | AutoDock Vina scoring function affinity (kcal/mol, lower = tighter) |
| `hbond` | Number of hydrogen bonds (donor + acceptor) |
| `hydrophobic` | Number of hydrophobic contacts |
| `pi_stacking` | Number of π-stacking / cation-π interactions |
| `salt_bridge` | Number of salt bridges (anionic + cationic) |
| `total_interactions` | Sum of all interaction types |

Example output:

```csv
rank,pdb_id,pocket,pose_rank,cnn_score,cnn_affinity,vina_affinity,hbond,hydrophobic,pi_stacking,salt_bridge,total_interactions
1,1EQG,pocket3,1,0.89,-7.2,-6.5,3,4,1,0,8
2,2HZI,pocket1,2,0.76,-6.8,-5.9,2,3,0,1,6
3,1EQG,pocket7,1,0.82,-6.5,-6.1,2,2,1,0,5
```

---

## Project Structure

```
WTFDTB/
├── pyproject.toml               # Package metadata, dependencies, entry point
├── recipe/
│   └── meta.yaml                # Bioconda / Conda-Forge recipe
├── src/
│   └── wtfdtb/
│       ├── __init__.py           # Version string
│       ├── cli.py                # Typer CLI — screen command + all flags
│       ├── ligand_prep.py        # Phase 1: SMILES/SDF → protonated 3D PDBQT
│       ├── receptor_curation.py  # Phase 2: PDB → cleaned, protonated receptor
│       ├── pocket_detection.py   # Phase 3: P2Rank ML pocket prediction
│       ├── docking.py            # Phase 4: GNINA CNN-rescored docking
│       ├── post_dock.py          # Phase 5: ProLIF interactions + ranking
│       ├── pipeline.py           # Orchestrator: wires Phases 1–5
│       └── utils.py              # PDB fetcher, logging, shared helpers
├── tests/
│   └── ...
└── README.md
```

---

## Tech Stack

| Layer | Tool | Purpose |
|-------|------|---------|
| **CLI** | [Typer](https://typer.tiangolo.com/) | Type-hinted CLI with auto-generated `--help` |
| **Ligand Protonation** | [Dimorphite-DL](https://github.com/durrantlab/dimorphite_dl) | pH-dependent protonation state enumeration |
| **Cheminformatics** | [RDKit](https://www.rdkit.org/) | 3D conformer generation (ETKDGv3), MMFF94 minimisation |
| **PDBQT Generation** | [Meeko](https://github.com/forlilab/Meeko) | Gasteiger charges, torsion tree for AutoDock-family |
| **PDB Parsing** | [Biopython](https://biopython.org/) | REMARK 465 parsing for quality gating |
| **PDB Cleaning** | [pdb-tools](https://github.com/haddocking/pdb-tools) | Strip HETATM, waters, alternate conformations |
| **Structure Repair** | [PDBFixer](https://github.com/openmm/pdbfixer) (OpenMM) | Model missing heavy atoms |
| **Receptor Protonation** | [PDB2PQR](https://github.com/Electrostatics/pdb2pqr) + PROPKA | Rigorous pKa-based protonation |
| **Pocket Detection** | [P2Rank](https://github.com/rdk/p2rank) | ML-based pocket prediction (Java) |
| **Docking** | [GNINA](https://github.com/gnina/gnina) | CNN-rescored docking (superior to AutoDock Vina) |
| **Interaction Fingerprints** | [ProLIF](https://github.com/chemosim-lab/ProLIF) | H-bond, hydrophobic, π-stacking, salt bridge detection |
| **Data** | [Pandas](https://pandas.pydata.org/) | Filtering, ranking, CSV export |
| **Parallelism** | `concurrent.futures` | ProcessPoolExecutor for receptors + docking |

---

## How It Works (In Detail)

### Phase 1: Ligand Preparation

1. Read input ligand (SMILES string or SDF/MOL file)
2. Enumerate physiological protonation states at the target pH using Dimorphite-DL
3. Generate 3D coordinates using RDKit's ETKDGv3 algorithm
4. Energy-minimise with the MMFF94 force field
5. Convert to PDBQT format (Gasteiger charges + torsion tree) via Meeko

### Phase 2: Receptor Curation

For each protein target (downloaded from RCSB or provided as local PDB):

1. Strip all HETATM records and water molecules using pdb-tools
2. Repair missing heavy atoms using PDBFixer (OpenMM)
3. Assign protonation states at physiological pH using PDB2PQR with PROPKA
4. Write the curated receptor PDB

This phase runs in parallel across all targets using ProcessPoolExecutor.

### Phase 3: Pocket Detection

1. Run P2Rank on all curated receptors in batch mode
2. Parse P2Rank output to extract binding pocket centers (X, Y, Z coordinates)
3. Each pocket defines a docking search box for Phase 4

P2Rank uses machine learning (random forests on surface features) to detect druggable pockets without requiring known binding site templates.

### Phase 4: Molecular Docking

For each (receptor, pocket) combination:

1. Build GNINA command-line arguments with pocket center and box size
2. Run GNINA with CNN rescoring enabled
3. Parse output SDF to extract per-pose CNNscore, CNNaffinity, and Vina affinity

This phase runs in parallel using ProcessPoolExecutor. GNINA uses convolutional neural networks trained on protein-ligand complexes to rescore docking poses, significantly outperforming classical scoring functions.

### Phase 5: Post-Docking Analysis

1. **CNNscore filter**: Discard poses below the threshold (default 0.5)
2. **Interaction profiling**: Use ProLIF to compute protein-ligand interaction fingerprints (H-bonds, hydrophobic contacts, π-stacking, salt bridges, cation-π)
3. **Interaction filter**: Discard poses with fewer interactions than `--min-interactions`
4. **Ranking**: Sort remaining poses by Vina affinity (kcal/mol, ascending) then CNN affinity (pKd, descending)
5. **Export**: Write ranked results to CSV

---

## Development

### Setup

```bash
git clone https://github.com/ChandraguptSharma07/WTFDTB.git
cd WTFDTB
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```

### Running Tests

```bash
pytest
```

### Code Quality

```bash
ruff check src/
ruff format src/
```

### Building the Conda Package

```bash
conda build recipe/
```

---

## Supported Platforms

| Platform | Status | Notes |
|----------|--------|-------|
| **Linux x86_64** | ✅ Supported | Primary platform. GNINA binary available via conda-forge. |
| **macOS** | ⚠️ Partial | Python pipeline works; GNINA must be compiled from source. |
| **Windows (WSL)** | ⚠️ Partial | Works through Windows Subsystem for Linux. |

---

## Citation

If you use WTFDTB in your research, please cite:

```bibtex
@software{wtfdtb2025,
  title  = {WTFDTB: High-Throughput Inverse Virtual Screening},
  author = {Chandragupt Sharma},
  year   = {2025},
  url    = {https://github.com/ChandraguptSharma07/WTFDTB}
}
```

And the key tools in the pipeline:

- **GNINA**: McNutt et al. *J. Cheminformatics* 13, 43 (2021)
- **P2Rank**: Krivák & Hoksza. *J. Cheminformatics* 10, 39 (2018)
- **ProLIF**: Bouysset & Fiorucci. *J. Cheminformatics* 13, 72 (2021)
- **RDKit**: [rdkit.org](https://www.rdkit.org/)

---

## License

MIT — see [LICENSE](LICENSE) for details.
