Metadata-Version: 2.4
Name: BSPDA
Version: 0.1.0
Summary: A 3D Protein Pocket Detection & Classification Pipeline using geometry and electrostatic methods
Author-email: Alberto Domingo <alberto.domingo01@estudiant.upf.edu>, Diego Maquieira <diego.maquieira01@estudiant.upf.edu>
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: rdkit
Requires-Dist: freesasa
Requires-Dist: biopython

# BSPDA: Binding-Site(s) Predictor by Diego and Alberto
**Authors:** Diego Maquieira Vidal and Alberto Domingo Gómez

## Introduction
Welcome to **BSPDA** (our PYT-SBI project). The aim of this pipeline is to predict the position of binding sites within the 3D structure of a given protein model.

While the primary focus is geometric prediction (locating surface cavities where a ligand could bind), BSPDA also performs complex chemical analysis. Using Solvent Accessible Surface Area (SASA) and studying partial charges, the pipeline suggests the type of ligand involved and lists the specific amino acids lining the pocket.

PocketPicker tool was used as the main reference to build BSPDA [1].

This tutorial will walk you through the modular architecture of the pipeline.

---

### Prerequisites
Before running this tutorial, ensure you have propperly installed the package.

```python
# Import the packages needed for this markdown.
import numpy as np
from rdkit import Chem
# Import the modular classes of the BSPDA pipeline
from structure import ProteinStructure
from geometry import GridGenerator
from detector import PocketDetector
from classifier import PocketClassifier
from io import Exporter
```

---

## Step 1: Setup and Data Preparation
First, it is needed to download the target protein and process it. PDB files often contain "noise" for the methods of this pipeline, such as crystallized water molecules, buffer ions or existing ligands. 

If hetero-atoms are not eliminated, the geometric algorithm will view the pockets as "already full" and fail to detect them.

```python
print("1. Initializing and Cleaning Protein Structure...")

# Define output directory and target PDB ID
output_dir = "eval_results/1bid"
pdb_target = "1bid"

# Initialize the Structure object
struct = ProteinStructure(pdb_id=pdb_target, work_dir=output_dir)

# Download the PDB and load it into RDKit
struct.downloadPDB()

# Strip water, ions, and co-crystallized ligands to leave empty cavities
struct.strip_htm()

# Calculate Van der Waals radii for all remaining protein atoms
ptable = Chem.GetPeriodicTable()
vdw_radii = np.array([ptable.GetRvdw(atom.GetAtomicNum()) for atom in struct.mol.GetAtoms()])

print(f"Successfully loaded {pdb_target} with {struct.coords.shape[0]} heavy atoms.")
```

---

## Step 2: 3D Grid Generation & Flood Fill
Before searching pockets, empty spaces need to be defined. 
The `GridGenerator` creates a 3D bounding box around the protein and fills it with a voxel grid (1.0 Å resolution). It then measures the distance from every voxel to the nearest protein atom. 

Using a 3D flood-fill algorithm, it distinguishes between the infinite bulk solvent outside the protein and the deep clefts/internal cavities.

```python
print("2. Generating 3D Grid and Isolating Surface Probes...")

# Initialize grid generator with protein coordinates and VDW radii
grid_gen = GridGenerator(struct.coords, vdw_radii, resolution=1.0, padding=8.0)

# Apply flood-fill to find points that are empty, connected to the surface, but near the protein
surface_probes = grid_gen.get_valid_surface_points()

print(f"Generated {len(surface_probes)} valid surface probe points.")
```

---

## Step 3: Ray-Casting Scoring & Clustering
Not all surface space is a binding pocket. To determine how "deep" or "buried" a point is, the `PocketDetector` casts 30 uniformly distributed rays outward from each probe. 

If a ray hits a protein atom, it counts as a hit. Probes with high hit counts (high "buriedness") are deep inside clefts. Deep probes are filtered and grouped into spatial clusters.

```python
print("3. Scoring Buriedness and Clustering Deep Pockets...")

detector = PocketDetector(struct.coords)

# Calculate ray-casting scores for all surface probes
detector.score_probes(surface_probes)

# Filter probes that are deeply buried (Score > 18 out of 30)
deep_mask = detector.scores > 18
deep_probes = surface_probes[deep_mask]
deep_scores = detector.scores[deep_mask]

# Group neighboring deep probes into distinct clusters (Minimum volume: 40 Å³)
clusters = detector.cluster_pockets(surface_probes)

print(f"Detected {len(clusters)} distinct binding pockets.")
```

---

## Step 4: Pocket(s) Classification (SASA-Based)
Once the geometric boundaries of the pockets have been defined, their biochemistry will be studied. 

The `PocketClassifier` utilizes `freesasa` to calculate the exposed surface area of the atoms lining the pocket. By combining this with RDKit's Gasteiger partial charges, one can calculate the percentage of **unpolar**, **positive**, and **negative** surface area.

Based on these established literature standards, pockets are classified regarding their biological and chemical roles. For instance, high localized percentages of negative or positive surface area dictate the binding behavior of specific co-factors, metal ions, or highly polar parts of larger ligands [2]. Furthermore, empirical NMR screening data highlights that the physical volume and hydrophobic properties of a cavity are the primary indicators of its druggability, as they are required to stabilize the aromatic rings common in pharmacology [4]. In this way, valid drug pockets must meet minimum volume thresholds (typically larger than 30-40 Å³, with primary druggable clefts often exceeding 500 Å³), while smaller or highly charged cavities tend to be related to the transient binding of ions or certain solvent molecules [4]. Ultimately, this study of spatial volume, surface area, and atomic partial charges tries to use the most important features used in recent methodologies [3].

```python
print("4. Analyzing Surface Chemistry and Lining Residues...\n")

# Initialize the classifier (this automatically computes Gasteiger charges and SASA)
classifier = PocketClassifier(struct.mol)

# Analyze all geometric clusters
analyzed_pockets = classifier.analyze_clusters(deep_probes, deep_scores, clusters)

# Print out the detailed report for each pocket
for p in analyzed_pockets:
    print(f"=== POCKET {p.pocket_id} ===")
    print(f"  Classification:  {p.classification}")
    print(f"  Volume:          {p.volume} A^3")
    print(f"  Chemical Makeup: {p.p_unp:.1f}% Unpolar | {p.p_pos:.1f}% Positive | {p.p_neg:.1f}% Negative")
    print(f"  Lining Residues: {', '.join(p.lining_residues)}\n")
```

---

## Step 5: Output Generation & PyMOL Visualization
The `Exporter` class maps the deep probes back into a standard PDB file format, cleverly storing the "buriedness score" in the B-factor column so we can apply color gradients.

It also automatically generates four distinct `.pml` scripts to view the protein surface, electrostatics, and pocket grids in PyMOL.

```python
print("5. Exporting PDB and PyMOL Visualization Scripts...")

pdb_filename = "scored_pockets.pdb"
pdb_out_path = f"{output_dir}/{pdb_filename}"

# Save the synthetic pocket probes into a unified PDB file
Exporter.save_pdb(struct.mol, analyzed_pockets, pdb_out_path)

# Generate the 4 PyMOL visualization permutations
Exporter.write_pymol_scripts(output_dir, pdb_filename, len(analyzed_pockets))

print("Pipeline Complete! Check your output directory for the PyMOL scripts.")
```

### Visualizing Results
Navigate to your output directory and open any of the generated scripts in PyMOL:
1. `1_grey_surface_gradient_dots.pml`: standard view, pocket depth shown via heat map.
2. `2_grey_surface_discrete_dots.pml`: standard view, pockets colored by unique ID.
3. `3_electro_surface_gradient_dots.pml`: protein colored by charge (red/blue), depth heat map.
4. `4_electro_surface_discrete_dots.pml`: protein colored by charge, pockets colored by ID.

---

### *1-Line Execution in the terminal*
For batch validations (like the PocketPicker validation script), it is not needed to write step by step the steps above. One can utilize the orchestration function defined in `__init__.py`:

```python
from pocket_finder import run_pipeline

# Runs steps 1-5 automatically and returns the list of PocketProperties dataclasses
pockets = run_pipeline(input_pdb="structure_files/pdb1bid.ent", output_dir="results_1bid")
```
## References

[1] Weisel, M., Proschak, E., & Schneider, G. (2007). PocketPicker: Analysis of ligand binding-sites with shape descriptors. Chemistry Central Journal, 1(1), 7. https://doi.org/10.1186/1752-153X-1-7

[2] Dundas, J., Adamian, L., & Liang, J. (2011). Structural signatures of enzyme binding pockets from order-independent surface alignment: A study of metalloendopeptidase and NAD binding proteins. Journal of Molecular Biology, 406(5), 713–729. https://doi.org/10.1016/j.jmb.2010.12.005

[3] Wang, D. D., Wu, W., & Wang, R. (2024). Structure-based, deep-learning models for protein–ligand binding affinity prediction. Journal of Cheminformatics, 16, 2. https://doi.org/10.1186/s13321-023-00795-9

[4] Hajduk, P. J., Huth, J. R., & Fesik, S. W. (2005). Druggability indices for protein targets derived from NMR-based screening data. Journal of Medicinal Chemistry, 48(7), 2518–2525. https://doi.org/10.1021/jm049131r
