Metadata-Version: 2.4
Name: drugclip
Version: 0.1.2
Summary: Multimodal Graph-Text Contrastive Learning for Drug Design
Home-page: https://huggingface.co/homerquan/DrugClip
Author: BioTarget Contributors
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: torch_geometric>=2.3
Requires-Dist: pandas
Requires-Dist: tqdm
Requires-Dist: rdkit
Requires-Dist: transformers
Requires-Dist: huggingface_hub
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# DrugCLIP: Multimodal Graph-Text Contrastive Learning for Drug Design 🧬✨

[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-ee4c2c.svg)](https://pytorch.org/)
[![RDKit](https://img.shields.io/badge/RDKit-cheminformatics-green.svg)](https://www.rdkit.org/)

**DrugCLIP** is a deep-learning core package designed to perform contrastive alignment between 3D molecular structures and textual therapeutic/clinical descriptions. It powers AI drug discovery pipelines (like [BioTarget](https://github.com/your-org/biotarget)) by scoring novel molecular geometries against clinical goals (binding affinity) and failure modes (toxicity).

---

## 🧩 Model Architecture

DrugCLIP maps text descriptions and 3D molecular point clouds into a shared 128-dimensional continuous latent space. It is designed to act as a *surrogate* multi-objective function, scoring binding potential and drug toxicity.

* **Graph Encoder**: [SchNet](https://github.com/atomistic-machine-learning/schnetpack) (3D Message Passing Neural Network), extracting features from atomic numbers (`z`) and 3D coordinates (`pos`).
* **Text Encoder**: DistilBERT (`distilbert-base-uncased`), projecting clinical natural language queries into semantic embeddings.
* **Loss Function**: InfoNCE (Contrastive Loss) matching paired batches of molecular geometries with their corresponding text records.

---

## 💾 Installation

To install DrugCLIP as a standalone pip package:

```bash
git clone https://github.com/your-org/drugclip.git
cd drugclip
pip install -e .
```

After installation, the CLI tool `drugclip` becomes globally available in your terminal.

---

## 📊 Dataset Preparation

DrugCLIP requires supervised data matching structures with clinical outcomes and textual descriptions. Out of the box, it supports preparing:

1. **MolTextNet** (Structure-to-text mapping)
2. **TDC Tox21** (Experimental toxicity assays)
3. **TDC ClinTox** (FDA clinical failure records)
4. **ChEMBL** (Unlabeled chemical lookup libraries)

To download all datasets directly into `data/`:
```bash
drugclip data download all
```

---

## ⚡ High-Performance Pre-Training

DrugCLIP is optimized for HPC and single-node multi-GPU setups. The PyTorch data-loaders utilize asynchronous pinning, and operations natively use Automatic Mixed Precision (AMP) via `torch.amp.autocast`. 3D molecular geometry is computed utilizing all available CPU cores via `ProcessPoolExecutor` before GPU encoding.

To train the contrastive alignment model:

```bash
# Full training run
drugclip train align

# Or validate your hardware limits with synthetic data
drugclip train align --quick-validate
```

Checkpoints are automatically saved to `runs/align/best.ckpt` relative to your execution path.

---

## 🔬 Standalone Inference

You can run isolated drug-retrieval inference directly through the DrugCLIP CLI:

```bash
drugclip infer --goal-text "A highly selective and safe kinase inhibitor" --top-n 5
```
