Metadata-Version: 2.4
Name: molcore-chem
Version: 0.7.0
Requires-Dist: torch>=2.2
Requires-Dist: numpy>=1.26
Requires-Dist: rdkit>=2023.9
Requires-Dist: torch-geometric>=2.5
Requires-Dist: pyarrow>=13.0
Requires-Dist: anthropic>=0.28 ; extra == 'agent'
Requires-Dist: molcore-chem[pandas,optuna,pretrained,bio,design,server,agent] ; extra == 'all'
Requires-Dist: transformers>=4.30 ; extra == 'bio'
Requires-Dist: sentencepiece>=0.1.99 ; extra == 'bio'
Requires-Dist: pytdc>=0.4.1 ; extra == 'bio'
Requires-Dist: scikit-learn>=1.3 ; extra == 'bio'
Requires-Dist: scipy>=1.11 ; extra == 'bio'
Requires-Dist: pandas>=2.0 ; extra == 'bio'
Requires-Dist: scikit-learn>=1.3 ; extra == 'design'
Requires-Dist: scipy>=1.11 ; extra == 'design'
Requires-Dist: requests>=2.28 ; extra == 'design'
Requires-Dist: meeko>=0.4 ; extra == 'design'
Requires-Dist: maturin ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-benchmark ; extra == 'dev'
Requires-Dist: pandas>=2.0 ; extra == 'dev'
Requires-Dist: nbformat>=5.0 ; extra == 'dev'
Requires-Dist: optuna>=3.0 ; extra == 'dev'
Requires-Dist: anthropic>=0.28 ; extra == 'dev'
Requires-Dist: ruff>=0.4 ; extra == 'dev'
Requires-Dist: optuna>=3.0 ; extra == 'optuna'
Requires-Dist: pandas>=2.0 ; extra == 'pandas'
Requires-Dist: molfeat>=0.9 ; extra == 'pretrained'
Requires-Dist: datamol>=0.12 ; extra == 'pretrained'
Requires-Dist: transformers>=4.30 ; extra == 'pretrained'
Requires-Dist: sentencepiece>=0.1.99 ; extra == 'pretrained'
Requires-Dist: fastapi>=0.110 ; extra == 'server'
Requires-Dist: uvicorn>=0.27 ; extra == 'server'
Requires-Dist: pydantic>=2.0 ; extra == 'server'
Provides-Extra: agent
Provides-Extra: all
Provides-Extra: bio
Provides-Extra: design
Provides-Extra: dev
Provides-Extra: optuna
Provides-Extra: pandas
Provides-Extra: pretrained
Provides-Extra: server
License-File: LICENSE
Summary: AI-native cheminformatics: Rust core + RDKit bridge + Python AI API
Requires-Python: >=3.11
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# molcore-chem

**AI-native cheminformatics toolkit** — Rust-accelerated fingerprints and PyG conversion, with full RDKit compatibility and a built-in MCP server.

[![CI](https://github.com/Anteneh-T-Tessema/molcore/actions/workflows/ci.yml/badge.svg)](https://github.com/Anteneh-T-Tessema/molcore/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/molcore-chem)](https://pypi.org/p/molcore-chem)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

```bash
pip install molcore-chem
```

---

## Overview

molcore extends RDKit workflows rather than replacing them. The hot paths — fingerprint generation and PyTorch Geometric graph conversion — are rewritten in Rust using Rayon parallelism and zero-copy array transfer, while standardization, descriptors, and scaffold splitting delegate to RDKit through an isolated bridge layer.

| Capability | Implementation | Notes |
| --- | --- | --- |
| ECFP4 fingerprints | Rust (Rayon + u64 bit-packing) | 35–132× faster than RDKit |
| PyG graph conversion | Rust (IntoPyArray → torch.from_numpy) | 4.3× faster, zero-copy |
| Tanimoto matrix | Rust (Rayon + popcount) | 4.3–29× faster at scale |
| Standardization, descriptors, scaffold split | RDKit (via rdkit_bridge.py) | Parity speed, cleaner API |

---

## Quickstart

```python
from molcore.molecule import Mol
from molcore.pipeline import featurize_smiles
from molcore.predictor import PropertyPredictor
from molcore.io import MolDataset
import numpy as np

# Parse — immutable, Rust-backed
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")   # aspirin
data = mol.to_pyg()                                # PyG Data, zero-copy, 9 node features

# Batch fingerprints — Rust Rayon parallel
fps = featurize_smiles(smiles_list, backend="rust")   # (N, 2048) uint8 Tensor

# Full dataset pipeline
ds = MolDataset.from_smiles(smiles_list, compute_fps=True, compute_desc=True)
ds.labels = np.array(logp_values, dtype=np.float32)
train_ds, val_ds, test_ds = ds.scaffold_split()

# Train GCN with MC Dropout uncertainty
pred = PropertyPredictor(hidden=64, epochs=100)
pred.fit(train_ds, val_dataset=val_ds)
means, stds = pred.predict_with_uncertainty(["CCO", "c1ccccc1"], n_samples=30)
```

**[Open in Colab →](https://colab.research.google.com/github/Anteneh-T-Tessema/molcore/blob/main/examples/quickstart.ipynb)**

---

## Benchmarks

All numbers on Apple M-series (arm64), CPU-only, Python 3.12.

### ECFP4 Fingerprints

| Batch size | molcore (Rust) | RDKit | Speedup |
| --- | --- | --- | --- |
| 1 000 SMILES | 1.3M mol/s | 14 800 mol/s | **88×** |
| 10 000 SMILES | 2.0M mol/s | 15 100 mol/s | **132×** |

### Tanimoto Similarity Matrix

| Query × Library | molcore (Rust) | RDKit BulkTanimoto | Speedup |
| --- | --- | --- | --- |
| 50 × 1 000 | 31M pairs/s | 7.3M pairs/s | **4.3×** |
| 500 × 10 000 | 224M pairs/s | 7.7M pairs/s | **29×** |

### End-to-End Pre-training Pipeline (500 molecules)

| Step | molcore | RDKit | Speedup |
| --- | --- | --- | --- |
| Standardize | 242 ms | 225 ms | ~parity |
| ECFP4 fingerprints | 1.1 ms | 37.3 ms | **35×** |
| 7 Lipinski descriptors | 124 ms | 114 ms | ~parity |
| Scaffold split | 33 ms | 35 ms | ~parity |
| PyG conversion (200 mols) | 3.3 ms | 14.4 ms | **4.3×** |

### GNN Property Prediction — ESOL Solubility

ESOL dataset (Delaney 2004, 1128 molecules), scaffold split. Scaffold split is substantially harder
than the random split used in published MoleculeNet baselines — results are not directly comparable
to the published RMSE ≈ 0.58.

| Configuration | RMSE | R² |
| --- | --- | --- |
| GCN, hidden=64, 3 layers, 300 epochs | 1.038 | 0.727 |
| Optuna-tuned (30 trials): hidden=128, 4 layers | 1.090 | 0.709 |

---

## Features

### Billion-Scale Streaming Screen

Screen libraries that do not fit in RAM using any `Iterable[str]` of SMILES — file iterators,
database cursors, or generators. Peak memory is `O(chunk_size × nbits/8)`.

```python
from molcore.streaming import stream_screen, StreamingScreen

def from_file(path):
    with open(path) as fh:
        for line in fh:
            yield line.strip().split()[0]

# Tanimoto similarity + SMARTS filter in a single pass
hits = stream_screen(
    from_file("chembl_34.smi"),
    query="c1ccc(N)cc1",
    query_smarts="[NH2]",
    threshold=0.4,
    chunk_size=10_000,
    progress=True,
)
for smiles, tanimoto_score in hits:
    print(smiles, tanimoto_score)

# Stateful version — screen multiple chunks, inspect running stats
screen = StreamingScreen(query="c1ccc(N)cc1", threshold=0.4)
for chunk in my_chunks:
    chunk_hits = screen.screen_chunk(chunk)
    save_hits(chunk_hits)
print(screen.stats)  # {n_screened, n_hits, hit_rate, elapsed_s, rate_mol_s}
```

### MCP Server

Any MCP-compatible host (Claude Desktop, Continue, Cursor) can invoke molcore tools directly
without a local Python installation.

```bash
molcore mcp                                    # stdio transport
molcore mcp --transport http --port 8765       # HTTP transport
```

**Claude Desktop** — add to `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "molcore": {
      "command": "python",
      "args": ["-m", "molcore.mcp_server"],
      "env": {}
    }
  }
}
```

Nine tools are exposed: `featurize`, `screen_smarts`, `screen_similarity`, `admet_screen`,
`synthesizability`, `generate`, `retro_score`, `active_suggest`, and `pareto_optimize`.

### SDF and Parquet I/O

```python
from molcore.io import MolDataset

ds = MolDataset.from_sdf("library.sdf")
ds = MolDataset.from_sdf("library.sdf", compute_fps=True, compute_desc=True)
ds.write_sdf("output.sdf")
ds.write_parquet("library.parquet")           # Arrow columnar, snappy-compressed
ds2 = MolDataset.read_parquet("library.parquet")
```

### Pandas Integration

```python
import molcore.pandas_tools as mpt

df = mpt.load_sdf("library.sdf")                  # DataFrame with 'Mol' + 'smiles' columns
df = mpt.add_descriptors(df, preset="lipinski")   # MolWt, LogP, TPSA, HBD, HBA, …
df = mpt.add_fingerprints(df, kind="ecfp4")       # adds 'fp' column
df = mpt.filter_by_smarts(df, "c1ccncc1")         # substructure filter in-place
df = mpt.standardize_smiles(df)                   # strip salts → neutralize → canonical tautomer
```

### Descriptors

```python
from molcore.rdkit_bridge import calc_named_descriptors

arr, names = calc_named_descriptors(smiles, preset="lipinski")   # 7 descriptors
arr, names = calc_named_descriptors(smiles, preset="druglike")   # 15 descriptors
arr, names = calc_named_descriptors(smiles, preset="all")        # ~200 descriptors
arr, names = calc_named_descriptors(smiles, names=["MolWt", "TPSA", "BertzCT"])
```

Returns `(N, D)` float32 arrays.

### Fingerprint Types

```python
fps = featurize_smiles(smiles, kind="ecfp4")                # (N, 2048) — Rust parallel
fps = featurize_smiles(smiles, kind="maccs")                # (N, 167)
fps = featurize_smiles(smiles, kind="atom_pairs")           # (N, 2048)
fps = featurize_smiles(smiles, kind="topological_torsions") # (N, 2048)
fps = featurize_smiles(smiles, kind="rdkit")                # (N, 2048) RDKit path-based
```

### 2D Depiction

```python
mol = Mol.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
mol              # renders inline in Jupyter via _repr_svg_
mol.to_png("aspirin.png")

ds = MolDataset.from_sdf("library.sdf")
ds               # renders 8-molecule grid inline
ds.draw_grid(n=20, mols_per_row=4)
```

### Standardization

```python
from molcore.rdkit_bridge import standardize

clean = standardize("[Na+].OC(=O)c1ccccc1")   # → "OC(=O)c1ccccc1"
# strips salts → neutralizes charges → canonical tautomer → canonical SMILES
```

### MCS and R-Group Decomposition

```python
from molcore.rdkit_bridge import find_mcs, rgroup_decompose

smarts = find_mcs(["CC(=O)Oc1ccccc1", "CC(=O)Oc1ccc(F)cc1", "CC(=O)Oc1ccc(Cl)cc1"])

rows = rgroup_decompose("c1ccc([*:1])cc1", smiles_list)
# → [{"Core": "c1ccccc1", "R1": "F"}, {"Core": "c1ccccc1", "R1": "Cl"}, ...]
```

### GCN Predictor with MC Dropout Uncertainty

```python
from molcore.predictor import PropertyPredictor

pred = PropertyPredictor(hidden=64, n_layers=3, epochs=100, dropout=0.1)
pred.fit(train_ds, val_dataset=val_ds, verbose=True)

predictions = pred.predict(smiles_list)                          # numpy array
means, stds = pred.predict_with_uncertainty(smiles_list, n_samples=30)

pred.save("logp_model.pt")
pred2 = PropertyPredictor.load("logp_model.pt")
```

### Drug-Target Interaction Prediction

```python
from molcore import DTIDataset, DTIPredictor

ds = DTIDataset(
    smiles    = ["CC(=O)O",    "c1ccccc1"],
    sequences = ["MKTLLILAVL", "ACDEFGHIKL"],
    labels    = [6.5,           7.2],          # pIC50
)

train, val, test = ds.scaffold_split(train_frac=0.8, val_frac=0.1)

pred = DTIPredictor(hidden=64, n_layers=3, epochs=100, model_type="gcn")
pred.fit(train, val_dataset=val)

affinities = pred.predict(["CCO"], ["MKTLLILAVL"])   # (N,) float32 pIC50
metrics    = pred.score(test)                         # {r2, mae, rmse, n}
```

`model_type` accepts `"gcn"`, `"gat"`, or `"gin"`. ESM-2 protein embeddings are available
via `pip install molcore-chem[bio]`.

---

## Installation

```bash
pip install molcore-chem
```

Requires Python 3.11+. RDKit and PyTorch are declared dependencies — no manual conda setup
required. Pre-compiled Rust extensions are included in the wheel.

**GPU (CUDA 12.1):**

```bash
pip install molcore-chem
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### Build from Source

```bash
git clone https://github.com/Anteneh-T-Tessema/molcore
cd molcore
./setup_dev.sh    # creates .venv, builds Rust extension, runs tests
source .venv/bin/activate
```

Requires Rust 1.70+:

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

---

## Architecture

```text
SMILES strings
  │
  ▼  Rust ingest (RDKit-backed aromaticity perception)
  │  — sanitize, kekulize, ring perception, implicit H
  ▼
petgraph StableGraph (immutable after construction)
  │
  ├─▶ ecfp4_batch()          → (N × 2048) uint8  ─▶ torch.from_numpy()  ─▶ Tensor
  │   Rayon parallel · u64 bit-pack · hardware popcount · 35–132× faster
  │
  ├─▶ mol_to_graph_arrays()  → node_feats (9-dim), edge_index, edge_attr ─▶ PyG Data
  │   Zero-copy IntoPyArray · 4.3× faster than manual Python construction
  │
  └─▶ tanimoto_matrix()      → (Q × L) float32
      Rayon parallel · u64 popcount · 29× faster at scale

Python layer (molcore/)
  molecule.py      — frozen Mol dataclass (FrozenInstanceError on mutation)
  pipeline.py      — featurize_smiles() batch-first entry point
  rdkit_bridge.py  — all RDKit calls isolated here (one file to update)
  io.py            — MolDataset: SDF + Parquet + DataFrame bridge
  predictor.py     — PropertyPredictor: 3-layer GCN + MC Dropout
  dti.py           — DTIPredictor: GCN/GAT/GIN ligand + 1D-CNN protein encoder
  pandas_tools.py  — DataFrame-first API for existing RDKit workflows
  agentic_rag.py   — ChemRAG: iterative chemical literature retrieval
```

### Design Invariants

1. `Mol` is always immutable — transforms return new instances.
2. RDKit is never in hot paths — all RDKit calls are isolated to `rdkit_bridge.py`.
3. All Rust→Python array transfers use `IntoPyArray` — no Python-side copy loops.
4. Batch API is primary — per-molecule methods are convenience wrappers.
5. Backend flags are explicit — `"rust"` or `"rdkit"` is always caller-supplied.

---

## Development

```bash
maturin develop --release --features extension-module   # build Rust extension
cargo test -p molcore-core                              # Rust unit tests
pytest tests/ evals/ -q                                 # 1061 Python/eval tests
python benchmarks/prove_scale.py                        # throughput benchmark (JSON)
python benchmarks/bench_e2e.py --n 1000                 # end-to-end benchmark
ruff check molcore/                                     # lint
```

---

## Documentation

- **[Quickstart notebook](examples/quickstart.ipynb)** — [Open in Colab](https://colab.research.google.com/github/Anteneh-T-Tessema/molcore/blob/main/examples/quickstart.ipynb)
- **[Migrating from RDKit](docs/migrating_from_rdkit.md)** — API mapping for common RDKit patterns
- **[End-to-end GNN example](examples/end_to_end_gnn.py)** — ESOL solubility benchmark
- **[Virtual screening pipeline](examples/virtual_screening_pipeline.py)**

---

## License

MIT — see [LICENSE](LICENSE).

