Metadata-Version: 2.4
Name: ChemicalDice
Version: 1.0.5
Summary: A dynamic, high-performance cheminformatics framework integrating 6 distinct molecular embeddings into a robust unified latent representation.
Author: ChemicalDice Team
Author-email: ChemicalDice Team <author@example.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.3
Requires-Dist: pandas>=1.4.3
Requires-Dist: tqdm>=4.65
Requires-Dist: requests>=2.32.4
Requires-Dist: rdkit>=2022.3.1
Requires-Dist: scikit-learn>=1.2.2
Requires-Dist: xgboost>=2.0.0
Requires-Dist: lightgbm>=4.0.0
Provides-Extra: training
Requires-Dist: torch>=2.2.1; extra == "training"
Requires-Dist: h5py>=3.13.0; extra == "training"
Requires-Dist: huggingface-hub>=0.20.0; extra == "training"
Provides-Extra: deployment
Requires-Dist: fastapi>=0.100.0; extra == "deployment"
Requires-Dist: uvicorn>=0.20.0; extra == "deployment"
Requires-Dist: pydantic>=2.0.0; extra == "deployment"
Requires-Dist: torch>=2.2.1; extra == "deployment"
Requires-Dist: huggingface-hub>=0.20.0; extra == "deployment"
Provides-Extra: descriptors
Requires-Dist: signaturizer==1.1.14; extra == "descriptors"
Requires-Dist: descriptastorus==2.6.1; extra == "descriptors"
Requires-Dist: mordred==1.2.0; extra == "descriptors"
Requires-Dist: transformers==4.40.1; extra == "descriptors"
Requires-Dist: multitasking==0.0.11; extra == "descriptors"
Requires-Dist: psutil>=6.0.0; extra == "descriptors"
Provides-Extra: all
Requires-Dist: ChemicalDice[deployment,descriptors,training]; extra == "all"
Dynamic: author
Dynamic: license-file

# **Chemical Dice Integrator (CDI)**  
**CDI (Chemical Dice Integrator)** is a high-performance deep learning framework designed to unify heterogeneous chemical representations into a single, high information rich latent space. By fusing six complementary molecular embeddings, CDI produces a consolidated vector optimized for large-scale cheminformatics, bioinformatics, and AI-driven molecular discovery tasks. 

##  **Overview**

CDI extends the **Chemical Dice Integrator** featurization ecosystem by performing unsupervised integration of **six distinct molecular embeddings**:

-  **Quantum Descriptors**  
-  **Bioactivity Signatures**  
-  **Language Model Embeddings**  
-  **Graph-Derived Representations**  
-  **Physicochemical Profiles**  
-  **2D Molecular Image Features**  

Each compound’s six feature types are combined to create a **single latent embedding** that captures chemical, structural, and biological semantics. These embeddings can be directly used for tasks such as **QSAR modeling**, **virtual screening**, **drug-target interaction prediction**, and **bioactivity clustering**.


### **Installation**

#### **1. Prerequisites & System Requirements**
*   **Python** (version 3.8 or higher)
*   **RDKit** (v2022.3.1 or higher) — [https://www.rdkit.org/](https://www.rdkit.org/)
*   **pandas** (v1.4.3 or higher) — [https://pandas.pydata.org/](https://pandas.pydata.org/)
*   **numpy** (v1.20.3 or higher) — [https://numpy.org](https://numpy.org)
*   **tqdm** (or v4.65 or higher) - [https://pypi.org/project/tqdm/](https://pypi.org/project/tqdm/)
* **requests** (2.32.4 or higher)-[https://pypi.org/project/requests/](https://pypi.org/project/requests/)

#### **2. Install Python Dependencies**

Open terminal or jupyter notebook run the following command to install all required python packages.

```bash
pip install numpy pandas rdkit tqdm requests
```

#### **3. Install the ChemicalDice Python Package**

```bash
pip install ChemicalDice
```

### **Usage**

#### **Feature Extraction from a CSV File**

The primary function, `smiles_to_embeddings`, processes a CSV file containing SMILES strings, validates and canonicalizes them, and streams the data to the ChemicalDice API to generate molecular embeddings.



**Step 1: Prepare Your Input CSV**

Your input file must meet the following requirements:

* **Column Name:** The file **must** contain a column named exactly `SMILES`.
* **File Size:** The input file size must not exceed **20 MB**.

**Example `smiles.csv`**:
```csv
SMILES,Compound_ID
CCO,Ethanol
Cc1ccccc1,Toluene
C1CCCCC1,Cyclohexane
```

**Step 2: Run the Feature Extraction**


```python
from ChemicalDice import smiles_to_embeddings

# Generate embeddings from CSV 
CDI_embeddings = smiles_to_embeddings.collect_features_from_csv(
    filepath="smiles.csv",
    convert_to_canonical=True
)

# CDI_embeddings is a pandas.DataFrame;
# Save to CSV
CDI_embeddings.to_csv("CDI_embeddings.csv", index=False)
```

#### **Function Details: `smiles_to_embeddings.collect_features_from_csv`**

*   **Purpose**: Processes a CSV file to generate molecular feature embeddings.
*   **Input**: Path to a CSV file with a `SMILES` column.
*   **Process**:
    1.  **Validation**: Uses RDKit to validate each SMILES string. Invalid entries are flagged and skipped.
    2.  **Canonicalization(Optional)**: The original `SMILES` column in your input CSV is converted to canonical SMILES. In case you do not want canonicalization you can set convert_to_canonical argument to False.
    3.  **Feature Extraction**: The CSV is streamed to the ChemicalDice API, which returns a data frame of molecular features.
*   **Output**: A dataframe where the first column contains the input **SMILES**, other columns correspond to the extracted features, and rows correspond to successfully processed molecules.  
This standardized output can be used directly for downstream tasks such as QSAR modeling, clustering, virtual screening, or integration into machine learning pipelines.

### **Troubleshooting & Notes**

*   **Backup Your Data**: The input CSV file is modified in-place. Always work on a copy of your **original data** to prevent data loss.
*   **Invalid SMILES**: Molecules with invalid SMILES will be skipped during processing and will not appear in the output feature dataframe. Check the function's messages or your overwritten CSV for details on which entries were invalid in column `is_valid`.
*   **Network Connection**: A stable internet connection is required to communicate with the ChemicalDice API.

For technical issues, please ensure all prerequisites are met and your configuration is correct. For API-related problems, contact the ChemicalDice service administrators.


---

## **CDI Bot**
**Chemical Dice Integrator — Conversational Molecular Embedding Platform**

CDI Bot is a fully containerised, LLM-powered web application that gives researchers and chemists a natural-language interface to the Chemical Dice Integrator (CDI).

> [!TIP]
> **Watch the CDI Bot in action:**
> [![Watch the video](https://img.youtube.com/vi/3NaBBTviEsA/0.jpg)](https://www.youtube.com/watch?v=3NaBBTviEsA)
---

For all other detailed information, please visit our **[complete documentation](https://the-ahuja-lab.github.io/ChemicalDice/)**.
