Metadata-Version: 2.4
Name: BSPDA
Version: 0.2.0
Summary: A 3D Protein Pocket Detection & Classification Pipeline using geometry and electrostatic methods
Author-email: Alberto Domingo <alberto.domingo01@estudiant.upf.edu>, Diego Maquieira <diego.maquieira01@estudiant.upf.edu>
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: rdkit
Requires-Dist: freesasa
Requires-Dist: biopython

# BSPDA: Binding-Site(s) Predictor by Diego and Alberto
**Authors:** Diego Maquieira Vidal and Alberto Domingo Gómez

## Introduction
Welcome to **BSPDA** (our PYT-SBI project). The aim of this pipeline is to predict the position of binding sites within the 3D structure of a given protein model.

While the primary focus is geometric prediction (locating surface cavities where a ligand could bind), BSPDA also performs complex chemical analysis. Using Solvent Accessible Surface Area (SASA) and studying partial charges, the pipeline suggests the type of ligand involved and lists the specific amino acids lining the pocket.

PocketPicker tool was used as the main reference to build BSPDA [1].

This tutorial will walk you through the modular architecture of the pipeline.

---

### Prerequisites
Before running this tutorial, ensure you have propperly installed the package.

```bash
pip install BSPDA
```

---

## Step 1: Setup and Data Preparation
First, it is needed to download the target protein and process it. PDB files often contain "noise" for the methods of this pipeline, such as crystallized water molecules, buffer ions or existing ligands. 

If hetero-atoms are not eliminated, the geometric algorithm will view the pockets as "already full" and fail to detect them.

```python
print("1. Initializing and Cleaning Protein Structure...")

import BSPDA as bp
import Chem
import numpy as np

# Define output directory and target PDB ID
output_dir = "eval_results/1bid"
input_pdb = "1bid"

# Initialize the Structure object
struct = bp.ProteinStructure(pdb_id=input_pdb, work_dir=output_dir)

# After initialization, we can download its PDB (This will generate its RDKit Mol object as well)
struct.downloadPDB()

# Strip water, ions, and co-crystallized ligands to leave empty cavities
struct.strip_htm()

# Calculate Van der Waals radii for all remaining protein atoms
ptable = Chem.GetPeriodicTable()
vdw_radii = np.array([ptable.GetRvdw(atom.GetAtomicNum()) for atom in struct.mol.GetAtoms()])

print(f"Successfully loaded {pdb_target} with {struct.coords.shape[0]} heavy atoms.")
```

---

## Step 2: 3D Grid Generation & Flood Fill
Before searching pockets, empty spaces need to be defined. 
The `GridGenerator` creates a 3D bounding box around the protein and fills it with a voxel grid (1.0 Å resolution). It then measures the distance from every voxel to the nearest protein atom. 

Using a 3D flood-fill algorithm, it distinguishes between the infinite bulk solvent outside the protein and the deep clefts/internal cavities.

```python

print("2. Generating 3D Grid and Isolating Surface Probes...")

# Initialize grid generator with protein coordinates and VDW radii
grid_gen = bp.GridGenerator(struct.coords, vdw_radii)

# Apply flood-fill to find points that are empty, connected to the surface, but near the protein
surface_probes = grid_gen.get_valid_surface_points()

print(f"Generated {len(surface_probes)} valid surface probe points.")
```

---

## Step 3: Ray-Casting Scoring & Clustering
Not all surface space is a binding pocket. To determine how "deep" or "buried" a point is, the `PocketDetector` casts 30 uniformly distributed rays outward from each probe. 

If a ray hits a protein atom, it counts as a hit. Probes with high hit counts (high "buriedness") are deep inside clefts. Deep probes are filtered and grouped into spatial clusters.

```python
print("3. Scoring Buriedness and Clustering Deep Pockets...")

detector = bp.PocketDetector(struct.coords)

# Calculate ray-casting scores for all surface probes
detector.score_probes(surface_probes)

# Filter probes that are deeply buried (Score > 18 out of 30)
# This step is not neccesary because because the threashold applied is 18 by default
deep_mask = detector.scores > 18
deep_probes = surface_probes[deep_mask]
deep_scores = detector.scores[deep_mask]

# Group neighboring deep probes into distinct clusters (Minimum volume: 40 Å³)
clusters = detector.cluster_pockets(surface_probes)

print(f"Detected {len(clusters)} distinct binding pockets.")
```

---

## Step 4: Pocket(s) Classification (SASA-Based)
Once the geometric boundaries of the pockets have been defined, their biochemistry will be studied. 

The `PocketClassifier` utilizes `freesasa` to calculate the exposed surface area of the atoms lining the pocket. By combining this with RDKit's Gasteiger partial charges, one can calculate the percentage of **unpolar**, **positive**, and **negative** surface area.

Based on these established literature standards, pockets are classified regarding their biological and chemical roles. For instance, high localized percentages of negative or positive surface area dictate the binding behavior of specific co-factors, metal ions, or highly polar parts of larger ligands [2]. Furthermore, empirical NMR screening data highlights that the physical volume and hydrophobic properties of a cavity are the primary indicators of its druggability, as they are required to stabilize the aromatic rings common in pharmacology [4]. In this way, valid drug pockets must meet minimum volume thresholds (typically larger than 30-40 Å³, with primary druggable clefts often exceeding 500 Å³), while smaller or highly charged cavities tend to be related to the transient binding of ions or certain solvent molecules [4]. Ultimately, this study of spatial volume, surface area, and atomic partial charges tries to use the most important features used in recent methodologies [3].

```python
print("4. Analyzing Surface Chemistry and Lining Residues...\n")

# Initialize the classifier (this automatically computes Gasteiger charges and SASA)
classifier = bp.PocketClassifier(struct.mol)

# Analyze all geometric clusters
analyzed_pockets = classifier.analyze_clusters(deep_probes, deep_scores, clusters)

# Print out the detailed report for each pocket
for p in analyzed_pockets:
    print(f"=== POCKET {p.pocket_id} ===")
    print(f"  Classification:  {p.classification}")
    print(f"  Volume:          {p.volume} A^3")
    print(f"  Chemical Makeup: {p.p_unp:.1f}% Unpolar | {p.p_pos:.1f}% Positive | {p.p_neg:.1f}% Negative")
    print(f"  Lining Residues: {', '.join(p.lining_residues)}\n")
```

---

## Step 5: Output Generation & PyMOL Visualization
The `Exporter` class maps the deep probes back into a standard PDB file format, cleverly storing the "buriedness score" in the B-factor column so we can apply color gradients.

It also automatically generates four distinct `.pml` scripts to view the protein surface, electrostatics, and pocket grids in PyMOL.

```python
import os

print("5. Exporting PDB and PyMOL Visualization Scripts...")

pdb_filename = f"scored_pockets_{input_pdb}.pdb"
pdb_out_path = os.path.join(output_dir, pdb_filename)

# Save the synthetic pocket probes into a unified PDB file
bp.Exporter.save_pdb(struct.mol, analyzed_pockets, pdb_out_path)

# Generate the 4 scenes PyMOL visualization script
Exporter.write_pymol_script(output_dir, pdb_filename, len(analyzed_pockets), input_pdb)

print("Pipeline Complete! Check your output directory for the PyMOL scripts.")
```

### Visualizing Results
Navigate to your output directory and open the generated PyMol script:
1. `F1 grey_surface_gradient`: standard view, pocket depth shown via heat map.
2. `F2 grey_surface_discrete`: standard view, pockets colored by unique ID.
3. `F3 electro_surface_gradient`: protein colored by charge (red/blue), depth heat map.
4. `F4 electro_surface_discrete`: protein colored by charge, pockets colored by ID.

```bash
pymol render_pockets_1BID.pml
```

---

### *1-Programatic use*
For batch validations (like the PocketPicker validation script), it is not needed to write step by step the steps above. One can utilize the orchestration functions defined in `__init__.py`, for example, a list of PDB ids can be provided.

```python
import BSPDA

pbd_list = ["1BID","1BIB"]

for id, pockets in BSPDA.get_multiple_protein_pockets(pdb_list).items():
    for p in pockets:
        print(f"   [POCKET {id} {p.pocket_id}] {p.classification}")
        print(f"     Vol: {p.volume} A^3 | Unp: {p.p_unp:.1f}% | Pos: {p.p_pos:.1f}% | Neg: {p.p_neg:.1f}%")
```

### *2-Command line use*

This program can be run with a single PDB id in bash, though, an output dir can be added using -o name

```bash
find-pockets 1BID
# Or
find-pockets 1BID -o out
```
## References

[1] Weisel, M., Proschak, E., & Schneider, G. (2007). PocketPicker: Analysis of ligand binding-sites with shape descriptors. Chemistry Central Journal, 1(1), 7. https://doi.org/10.1186/1752-153X-1-7

[2] Dundas, J., Adamian, L., & Liang, J. (2011). Structural signatures of enzyme binding pockets from order-independent surface alignment: A study of metalloendopeptidase and NAD binding proteins. Journal of Molecular Biology, 406(5), 713–729. https://doi.org/10.1016/j.jmb.2010.12.005

[3] Wang, D. D., Wu, W., & Wang, R. (2024). Structure-based, deep-learning models for protein–ligand binding affinity prediction. Journal of Cheminformatics, 16, 2. https://doi.org/10.1186/s13321-023-00795-9

[4] Hajduk, P. J., Huth, J. R., & Fesik, S. W. (2005). Druggability indices for protein targets derived from NMR-based screening data. Journal of Medicinal Chemistry, 48(7), 2518–2525. https://doi.org/10.1021/jm049131r
