Metadata-Version: 2.4
Name: casper-descriptor
Version: 0.1.0
Summary: CASPER: Conformer-Averaged Surface Property Encoded Representation -- a tunable 3D molecular descriptor
Author: dehaenw
License: MIT
Project-URL: Homepage, https://github.com/dehaenw/casper
Keywords: cheminformatics,molecular-descriptors,QSAR,rdkit,3D,VSA,VolSurf
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Requires-Dist: rdkit>=2022.9
Requires-Dist: scikit-learn>=1.0
Provides-Extra: viz
Requires-Dist: py3Dmol>=2.0; extra == "viz"
Requires-Dist: matplotlib>=3.5; extra == "viz"
Provides-Extra: jazzy
Requires-Dist: jazzy>=0.0.12; extra == "jazzy"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# CASPER

**C**onformer-**A**veraged **S**urface **P**roperty **E**ncoded **R**epresentation — a tunable 3D molecular descriptor.

CASPER turns a molecule into a fixed-length feature vector by:

1. generating ETKDG conformers (no force-field minimisation needed by default),
2. building a van-der-Waals dot surface for each conformer,
3. **colouring** each surface point by an atomic property (partial charge, logP, molar refractivity, TPSA, H-bond donor/acceptor, …),
4. **encoding** the coloured surface as a property *histogram* (a 3D generalisation of MOE-style VSA descriptors) and/or a density-invariant spatial *autocorrelation* (VolSurf-flavoured), and
5. **pooling** across conformers.

Unlike 2D VSA descriptors, CASPER is built on real 3D conformer surfaces, and unlike single-conformer VolSurf it averages over a conformer ensemble. Every step is exposed as a tunable parameter, and the per-conformer "bag" can be returned un-pooled for multi-instance / key-instance learning.

## Install

```bash
pip install casper-descriptor
# with visualisation extras:
pip install "casper-descriptor[viz]"
```

Core dependencies: `numpy`, `rdkit`, `scikit-learn`.

## Quick start

```python
import casper

# one molecule -> one fixed-length vector (default config)
v = casper.featurize("CC(=O)Oc1ccccc1C(=O)O")

# tune the construction
cfg = casper.CasperConfig(
    n_confs=10,
    properties=("gasteiger", "abs_charge", "logp", "mr", "tpsa"),
    encoding=("hist", "autocorr"),
    n_bins=12, autocorr_bins=8, autocorr_max_dist=16.0,
    conf_pool=("mean", "max"),
)
v = casper.featurize("CCO", cfg)

# names carry full provenance: e.g. "mean:gasteiger|ac[0.0,2.0)A"
v, names = casper.featurize("CCO", cfg, return_names=True)

# batch, parallel across molecules
X = casper.featurize_many(smiles_list, cfg, n_jobs=-1)

# sklearn transformer (drops into Pipeline / GridSearchCV)
from casper import CasperFeaturizer
ft = CasperFeaturizer(n_confs=10, density=16, encoding=("hist", "autocorr"))
X = ft.fit_transform(smiles_list)
```

## Multi-instance learning (un-pooled bag)

For key-instance detection, get the per-conformer instances without pooling:

```python
bag, conformer_ids, names = casper.featurize_bag("CCO", cfg)
# bag: (K, d) array, one CASPER vector per conformer; K varies per molecule
# conformer_ids: trace a flagged key-instance back to its 3D geometry (deterministic for a fixed seed)

bags, names = casper.featurize_bags(smiles_list, cfg)   # ragged list of (K_i, d)
```

`casper.featurize(...)` is exactly `pool()` applied to this bag.

## Feature visualization

When a model flags a CASPER feature as important, you can see *what surface region
it measures* — every feature name carries full provenance back to the surface.

```python
import casper
cfg = casper.CasperConfig(properties=("gasteiger",), encoding=("hist", "autocorr"))

# write a static PNG (matplotlib) and/or interactive HTML (py3Dmol)
casper.explain_feature("CC(=O)Oc1ccccc1C(=O)O", "mean:gasteiger|ac[2.0,4.0)A",
                       cfg, png="feature.png", html="feature.html")

# in a notebook, omit png/html for an inline interactive py3Dmol view
casper.explain_feature("CCO", "mean:gasteiger|hist[0.17,0.25)", cfg)
```

- **Histogram bins** highlight the surface points whose property falls in the bin
  (the highlighted area provably equals the feature's value).
- **Autocorrelation bins** show either a per-point *contribution* score
  (`autocorr_mode="contribution"`, default) or the literal contributing
  point-pairs at that separation (`autocorr_mode="pairs"`).

Requires the viz extra: `pip install "casper-descriptor[viz]"`.

## Key parameters (`CasperConfig`)

| parameter | default | what it does |
|---|---|---|
| `n_confs` | 10 | ETKDG conformers per molecule |
| `optimize` | `"none"` | `"none"` (raw ETKDG, fast) / `"mmff"` / `"uff"` |
| `properties` | `("gasteiger","logp","mr")` | which atomic properties colour the surface |
| `probe` | `0.0` | `0.0` = VdW surface; `1.4` = water-accessible |
| `density` | `16` | surface dots per atom (knee of the accuracy/cost curve; cost is ~quadratic via autocorr) |
| `encoding` | `("hist",)` | `"hist"` and/or `"autocorr"` |
| `n_bins` | 12 | histogram bins per property |
| `autocorr_bins`, `autocorr_max_dist` | 8, 12.0 | distance bins / radial extent for autocorrelation |
| `autocorr_normalize` | `True` | density-invariant mean-per-bin (recommended) vs legacy area-weighted sum |
| `autocorr_range` | `None` | per-property `(max_dist, n_bins)` override |
| `conf_pool` | `("mean",)` | `mean`/`max`/`min`/`std`/`boltzmann`, concatenated |

Add your own colouring:

```python
import numpy as np
casper.register_property("my_prop", lambda mol: np.array([...]), (lo, hi))
```

### Optional: kallisto / Jazzy properties

Five extra per-atom colourings derived from [Jazzy](https://jazzy.readthedocs.io)
(kallisto EEQ charges) add signal orthogonal to the built-ins — a different charge
model (`eeq`), a real charge-dependent dynamic polarisability (`alp`, unlike a
per-element constant), and *continuous* H-bond strengths (`sa` acceptor, `sdc`/`sdx`
donor) rather than binary flags:

```python
import casper.jazzy_properties        # registers: eeq, alp, sa, sdc, sdx
cfg = casper.CasperConfig(properties=("gasteiger", "eeq", "alp", "sa", "sdc", "sdx"))
v = casper.featurize("CC(=O)Nc1ccc(O)cc1", cfg)
```

They are computed once per molecule on CASPER's own geometry (cached), so all five
cost roughly one kallisto evaluation. Requires `pip install "casper-descriptor[jazzy]"`
(note: jazzy pins `numpy<2`).

## Notes

- **`density=16`** is the default because surface cost is roughly quadratic (the autocorrelation is O(points²)) and accuracy plateaus there; lower (12) degrades generalisation, higher (24/32) costs more for no measurable gain.
- The normalized autocorrelation is **density-invariant**, so changing `density` does not silently rescale features.
- Conformer count `K` is data-dependent (ETKDG + RMS pruning). Set `prune_rms=0` for a fixed `K`.

## License

MIT
