Metadata-Version: 2.4
Name: molalkit
Version: 1.0.1
Summary: MolALKit: A Toolkit for Active Learning in Molecular Data.
Author-email: Yan Xiang <yan.xiang@duke.edu>
License: The copyrights of this software are owned by Duke University. As such, two licenses for this software are offered:
        1. An open-source license under the GPLv2 license for non-commercial academic use.
        2. A custom license with Duke University, for commercial use or uses without the GPLv2 license restrictions. 
        
        As a recipient of this software, you may choose which license to receive the code under. 
        Outside contributions to the Duke-owned codebase cannot be accepted unless the contributor transfers the copyright to those changes over to Duke University.
        To enter a custom license agreement without the GPLv2 license restrictions, please contact the Digital Innovations department at the 
        Duke Office for Translation & Commercialization (OTC) (https://otc.duke.edu/digital-innovations/) at otcquestions@duke.edu.
        
        Please note that this software is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
        
Project-URL: source, https://github.com/xiangyan93/molalkit
Project-URL: PyPi, https://pypi.org/project/molalkit
Keywords: chemistry,active machine learning,molecular property prediction,drug discovery
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mgktools==3.3.0
Requires-Dist: selfies
Requires-Dist: xgboost
Dynamic: license-file


# MolALKit: A Toolkit for Active Learning in Molecular Data.
This software package serves as a robust toolkit designed for the active learning of molecular data.

## Installation
Check the GPU and CUDA requirements at [mgktools](https://github.com/Xiangyan93/mgktools) for marginalized graph kernel model. Non-CUDA installation is not supported.

Python 3.12, CUDA12.4, and GCC11.2 are recommended.
### Minimum installation
```
pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2 molalkit
```
### Support Chemprop
```
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/Xiangyan93/chemprop4molalkit.git@v0.0.0
```
### Support GraphGPS
```
pip install torch-scatter torch-sparse torch-geometric pytorch-lightning yacs torchmetrics performer-pytorch ogb git+https://github.com/Xiangyan93/graphgps4molalkit.git@v0.0.0 -f https://data.pyg.org/whl/torch-2.6.0+cu124.html
```
### Support MolFormer
```
pip install transformers pytorch-fast-transformers git+https://github.com/Xiangyan93/molformer4molalkit.git@v0.0.0
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation git+https://github.com/NVIDIA/apex.git
```
Then download pretrained model at https://github.com/IBM/molformer.

## QuickStart
GPU is required to support graph kernel. 
It will take about 10 minutes to set up the environment and run the demo.
- [Google Colab notebook](https://colab.research.google.com/drive/11thNx7RkGbGMe_TgieWCEShYy-h32Khv?usp=sharing).

## Data
**MolALKit** currently supports active learning exclusively for single-task datasets, which can be either classification or regression tasks.

### Custom Dataset
The data file must be in CSV format with a header row, structured as follows:
```
smiles,p_np
[Cl].CC(C)NCC(O)COc1cccc2ccccc12,1
C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl,1
...
```
The following arguments are required to run the active learning 
```
--data_path <dataset.csv> --smiles_columns <smiles> --targets_columns <target> --task_type <classification/regression>
```

### Public Dataset
The toolkit incorporates several popular public datasets, such as MoleculeNet and TDC, which can be used directly `--data_public <dataset name>`.

Here is the list of available datasets:
```
from molalkit.data.datasets import AVAILABLE_DATASETS
print(AVAILABLE_DATASETS)
```

### ActiveLearning/Validation Split
Our code supports several methods of splitting data into an active learning set and a validation set. 
The active learning is used for active learning and the validation set is used for evaluating the performance of the active learning model.
* **random**  The data will be split randomly.
* **scaffold_order** With this approach, the data is split based on molecular scaffolds, ensuring that the same scaffold never appears in both the active learning and validation sets. 
The scaffold containing the most molecules is placed in the active learning set. This method aligns with the implementation in DeepChem and is independent of random seeds.
* **scaffold_random** In this method, the placement of scaffolds in either the active learning set or the validation set is done randomly. 
This split is dependent on random seeds and introduces an element of randomness into the scaffold split.

The following arguments are required for data split:
```
--split_type <random/scaffold_order/scaffold_random> --split_sizes <active learning set ratio> <validation set ratio> --seed <random seed>
```

## Machine Learning Model
The machine learning model used in this package is described in a json config file. 
Here is the list of built-in machine learning models:
```
from molalkit.models.configs import AVAILABLE_MODELS
print(AVAILABLE_MODELS)
```
The model config files are placed in [molalkit/models/configs](https://github.com/RekerLab/MolALKit/tree/main/molalkit/models/configs). 
The following arguments are required for choosing machine learning models:
```
--model_configs <model_config_file>
```

## First Example
Here's an example of running active learning using MolALKit with the BACE dataset, a 50:50 scaffold split, and Random Forest as the machine learning model:
```
molalkit_run --data_public bace --metrics roc_auc mcc accuracy precision recall f1_score --model_configs RandomForest_Morgan_Config --split_type scaffold_order --split_sizes 0.5 0.5 --evaluate_stride 10 --seed 0 --save_dir bace --init_size 2 --select_method explorative --s_batch_size 1 --max_iter 100
```
