Metadata-Version: 2.4
Name: chemap
Version: 0.3.1
Summary: Library for computing molecular fingerprint based similarities as well as dimensionality reduction based chemical space visualizations. 
License-Expression: MIT
License-File: LICENSE
Author: Florian Huber
Author-email: florian.huber@hs-duesseldorf.de
Requires-Python: >=3.11,<3.14
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Provides-Extra: cpu
Provides-Extra: gpu-cu12
Provides-Extra: gpu-cu13
Requires-Dist: cuml-cu12 (>=25.6.0) ; (platform_system == "Linux") and (extra == "gpu-cu12")
Requires-Dist: cuml-cu13 (>=26.0.0) ; (platform_system == "Linux") and (extra == "gpu-cu13")
Requires-Dist: cupy-cuda12x (>=13.0.0) ; (platform_system == "Linux") and (extra == "gpu-cu12")
Requires-Dist: cupy-cuda13x (>=13.0.0) ; (platform_system == "Linux") and (extra == "gpu-cu13")
Requires-Dist: joblib (>=1.3.2)
Requires-Dist: map4 (>=1.1.3)
Requires-Dist: matplotlib (>=3.10.1)
Requires-Dist: numba (>=0.61.2)
Requires-Dist: numpy (>=2.1.0)
Requires-Dist: pandas (>=2.2.1)
Requires-Dist: pooch (>=1.8.2)
Requires-Dist: pynndescent (>=0.5.13) ; extra == "cpu"
Requires-Dist: rdkit (>=2024.9.6)
Requires-Dist: scikit-fingerprints (>=1.15.0)
Requires-Dist: scipy (>=1.14.2)
Requires-Dist: tqdm (>=4.67.1)
Requires-Dist: umap-learn (>=0.5.8) ; extra == "cpu"
Description-Content-Type: text/markdown


<img src="./materials/chemap_logo_green_pink.png" width="400">

![GitHub License](https://img.shields.io/github/license/matchms/chemap?color=#00B050)
[![PyPI](https://img.shields.io/pypi/v/chemap?color=#00B050)](https://pypi.org/project/chemap/)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/matchms/chemap/CI_build_and_matrix_tests.yml?color=#00B050)
[![Powered by RDKit](https://img.shields.io/badge/Powered%20by-RDKit-3838ff.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAAFVBMVEXc3NwUFP8UPP9kZP+MjP+0tP////9ZXZotAAAAAXRSTlMAQObYZgAAAAFiS0dEBmFmuH0AAAAHdElNRQfmAwsPGi+MyC9RAAAAQElEQVQI12NgQABGQUEBMENISUkRLKBsbGwEEhIyBgJFsICLC0iIUdnExcUZwnANQWfApKCK4doRBsKtQFgKAQC5Ww1JEHSEkAAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAyMi0wMy0xMVQxNToyNjo0NyswMDowMDzr2J4AAAAldEVYdGRhdGU6bW9kaWZ5ADIwMjItMDMtMTFUMTU6MjY6NDcrMDA6MDBNtmAiAAAAAElFTkSuQmCC)](https://www.rdkit.org/)
    
# chemap - Mapping chemical space
Library for computing molecular fingerprint based similarities as well as dimensionality reduction based chemical space visualizations.

## Installation
`chemap` can be installed using pip.
```bash
pip install chemap
```
Or, to include UMAP computation abilities on either CPU or GPU chose one of the following option:
- CPU version: ```pip install "chemap[cpu]"```
- GPU version (CUDA 12): ```pip install "chemap[gpu-cu12]"```
- GPU version (CUDA 13): ```pip install "chemap[gpu-cu13]"```

## Fingerprint computations
Fingerprints can be computed using generators from `RDKit` or `scikit-fingerprints`. 
This includes popular fingerprint types such as:

### Path-based and circular fingerprints
- RDKit fingerprints
- Morgan fingerprints

### Predefined substructure fingerprints
- MACCS fingerprints
- PubChem fingerprints
- Klekota-Roth fingerprints

### Topological distance based fingerprints
- Atom pair fingerprints
- MAP4 fingerprints



Here a code example:

```python
import numpy as np
import scipy.sparse as sp
from rdkit.Chem import rdFingerprintGenerator
from skfp.fingerprints import MAPFingerprint, AtomPairFingerprint

from chemap import compute_fingerprints, DatasetLoader, FingerprintConfig


ds_loader = DatasetLoader()
smiles = ds_loader.load("tests/data/smiles.csv")

# ----------------------------
# RDKit: Morgan (folded, dense)
# ----------------------------
morgan = rdFingerprintGenerator.GetMorganGenerator(radius=3, fpSize=4096)
X_morgan = compute_fingerprints(
    smiles,
    morgan,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=False,   # dense numpy
        invalid_policy="raise",
    ),
)
print("RDKit Morgan:", X_morgan.shape, X_morgan.dtype)

# -----------------------------------
# RDKit: RDKitFP (folded, CSR sparse)
# -----------------------------------
rdkitfp = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=4096)
X_rdkitfp_csr = compute_fingerprints(
    smiles,
    rdkitfp,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=True,    # SciPy CSR
        invalid_policy="raise",
    ),
)
assert sp.issparse(X_rdkitfp_csr)
print("RDKit RDKitFP (CSR):", X_rdkitfp_csr.shape, X_rdkitfp_csr.dtype, "nnz=", X_rdkitfp_csr.nnz)

# --------------------------------------------------
# scikit-fingerprints: MAPFingerprint (folded, dense)
# --------------------------------------------------
# MAPFingerprint is a MinHash-like fingerprint (different from MAP4 lib).
map_fp = MAPFingerprint(fp_size=4096, count=False, sparse=False)
X_map = compute_fingerprints(
    smiles,
    map_fp,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=False,
        invalid_policy="raise",
    ),
)
print("skfp MAPFingerprint:", X_map.shape, X_map.dtype)

# ----------------------------------------------------
# scikit-fingerprints: AtomPairFingerprint (folded, CSR)
# ----------------------------------------------------
atom_pair = AtomPairFingerprint(fp_size=4096, count=False, sparse=False, use_3D=False)
X_ap_csr = compute_fingerprints(
    smiles,
    atom_pair,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=True,
        invalid_policy="raise",
    ),
)
assert sp.issparse(X_ap_csr)
print("skfp AtomPair (CSR):", X_ap_csr.shape, X_ap_csr.dtype, "nnz=", X_ap_csr.nnz)

# (Optional) convert CSR -> dense if you need a NumPy array downstream:
X_ap = X_ap_csr.toarray().astype(np.float32, copy=False)
```

## UMAP Chemical Space Visualization
`chemap` provides functions to compute UMAP coordinates based on molecular fingerprints.
Depending on your system and installation, this can be either via a very fast `cuml` library by
using `create_chem_space_umap_gpu`, which then only allows to use "cosine" as a metric, as well
as folded/fixed sized fingerprints.
The alternative is a numba-based variant `create_chem_space_umap` (so this is still optimized,
but much slower than the GPU version). While this is slower, it in return allows to use Tanimoto
as a metric and can also handle unfolded fingerprints.

Example:
```python
from rdkit.Chem import rdFingerprintGenerator
from chemap.plotting import create_chem_space_umap, scatter_plot_hierarchical_labels

data_plot = create_chem_space_umap(
    data_compounds,  # dataframe with smiles and class/subclass etc. information
    col_smiles="smiles",
    inplace=False,
    x_col="x",
    y_col="y",
    fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=9, fpSize=4096),
)

# Plot
fig, ax, _, _  = scatter_plot_hierarchical_labels(
    data_plot,
    x_col="x",
    y_col="y",
    superclass_col="Superclass",
    class_col="Class",
    low_superclass_thres=2500,
    low_class_thres=5000,
    max_superclass_size=10_000,
```



