Metadata-Version: 2.4
Name: molgen
Version: 0.1.0
Summary: A lightweight toolkit for de novo molecular generation (SMILES/SELFIES; CharRNN, MolGPT, VAE)
Author: Daoyuan Li
License: MIT
Project-URL: Homepage, https://github.com/DaoyuanLi2816/Molecule-Generator
Project-URL: Repository, https://github.com/DaoyuanLi2816/Molecule-Generator
Project-URL: Issues, https://github.com/DaoyuanLi2816/Molecule-Generator/issues
Keywords: molecular-generation,smiles,vae,cheminformatics,deep-learning,drug-discovery
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: numpy>=1.21
Requires-Dist: pandas>=1.5
Requires-Dist: scikit-learn>=1.0
Requires-Dist: rdkit>=2022.9
Requires-Dist: tqdm>=4.60
Provides-Extra: selfies
Requires-Dist: selfies>=2.1; extra == "selfies"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: pre-commit>=3.5; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/DaoyuanLi2816/Molecule-Generator/main/docs/banner.svg" alt="molgen — lightweight de novo molecular generation: SMILES and SELFIES tokenizers, Transformer β-TC-VAE / CharRNN / MolGPT generators, MOSES-style evaluation." width="880">
</p>

<div align="center">

[![CI](https://github.com/DaoyuanLi2816/Molecule-Generator/actions/workflows/ci.yml/badge.svg)](https://github.com/DaoyuanLi2816/Molecule-Generator/actions/workflows/ci.yml)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

</div>

A lightweight, modern toolkit for **de novo molecular generation** with deep
sequence models. It provides atom-level SMILES and SELFIES tokenizers, several
generator architectures, a mixed-precision training loop, configurable
sampling, and a MOSES-style evaluation suite — small enough to train on a single
GPU in minutes, but reflecting current practice.

![molgen — a molecular generation pipeline: input, tokenize, model, sample, evaluate](https://raw.githubusercontent.com/DaoyuanLi2816/Molecule-Generator/main/assets/pipeline.png)

## Features

- **Representations** — atom-aware regex SMILES tokenizer and a SELFIES
  tokenizer (every sequence decodes to a valid molecule).
- **Models** — a Transformer β-TC-VAE, a GRU/LSTM `CharRNN`, and a
  decoder-only `MolGPT`.
- **Training** — teacher-forced loop with AdamW, gradient clipping, and
  automatic mixed precision (AMP) on CUDA.
- **Sampling** — autoregressive generation with temperature, top-k, and
  top-p (nucleus) filtering.
- **Metrics** — validity, uniqueness, novelty, internal diversity, unique
  scaffolds, SNN, and QED / logP / MW / SA-score property summaries.
- **Tooling** — `molgen` CLI, a bundled sample dataset, tests, CI, and ruff.

## Installation

```bash
git clone https://github.com/DaoyuanLi2816/Molecule-Generator.git
cd Molecule-Generator
pip install -e .            # add ".[selfies]" for SELFIES, ".[dev]" for tests
```

## Quickstart (Python)

```python
from molgen.data import build_dataloaders, load_sample_smiles
from molgen.tokenizers import SmilesTokenizer
from molgen.molgpt import MolGPT
from molgen.trainer import TrainConfig, train_language_model
from molgen.sampling import sample
from molgen.metrics import evaluate_generation

smiles = load_sample_smiles()                      # bundled sample, or your own list
tokenizer = SmilesTokenizer.from_smiles(smiles)
train_loader, val_loader = build_dataloaders(smiles, tokenizer, augment=True)

model = MolGPT(tokenizer.vocab_size, pad_idx=tokenizer.pad_id)
train_language_model(model, train_loader, val_loader, TrainConfig(epochs=20), pad_idx=tokenizer.pad_id)

generated = sample(model, tokenizer, num_samples=1000, top_p=0.95)
print(evaluate_generation(generated, reference=smiles))
```

## Quickstart (CLI)

```bash
molgen train  --data molecules.smi --model molgpt --epochs 20 --out model.pt
molgen sample --checkpoint model.pt --num 1000 --top-p 0.95 --out generated.smi
molgen eval   --generated generated.smi --reference molecules.smi
```

## Example output

Training `MolGPT` on the bundled (synthetic) sample and sampling 300 molecules
produces a report like:

```text
n_generated: 300
validity: 0.30
uniqueness: 0.96
novelty: 0.90
internal_diversity: 0.90
unique_scaffolds: 0.32
snn: 0.47
properties: {'qed': 0.52, 'logp': 1.71, 'mol_weight': 133.2, 'sa_score': 2.70}
```

These numbers reflect the tiny bundled sample — train on MOSES/QM9/ZINC for
stronger models. (SELFIES mode guarantees 100% validity.)

## Visualizations

Both figures come from **real model output** and are reproducible with
`python scripts/make_figures.py` (trains a SELFIES MolGPT on the bundled sample).

**Generated molecules** — structures sampled directly from the trained model:

![Molecules generated by the model](https://raw.githubusercontent.com/DaoyuanLi2816/Molecule-Generator/main/assets/generated_molecules.png)

**Goal-directed generation** — from a *single* base model, fine-tuning toward
the most (or least) drug-like molecules steers the generated QED distribution
in **both** directions (a ~0.15 QED span) and moves the samples through QED-vs-SA
property space. Generation can be steered toward a target, not just imitated:

![Bidirectional QED steering and movement through QED–SA property space](https://raw.githubusercontent.com/DaoyuanLi2816/Molecule-Generator/main/assets/controlled_generation.png)

## Models

| Model | Module | Description |
|-------|--------|-------------|
| `CharRNN` | `molgen.char_rnn` | GRU/LSTM next-token language model (classic strong baseline) |
| `MolGPT` | `molgen.molgpt` | Decoder-only Transformer with causal attention |
| `BetaTCVAE` | `molgen.vae` | Transformer VAE for reconstruction and latent interpolation |

Both `CharRNN` and `MolGPT` train and sample through the same trainer/sampler.

## Latent-space exploration (VAE)

The original VAE workflow is still available for generating molecules near a
seed or interpolating between two molecules in latent space:

![VAE encoder, latent space, and decoder](https://raw.githubusercontent.com/DaoyuanLi2816/Molecule-Generator/main/molecule.png)

```bash
python -m molgen.synthetic     # build a synthetic dataset (molecules.csv)
python -m molgen.vae           # train the VAE
python -m molgen.generate      # perturb the latent space
python -m molgen.interpolate   # interpolate between two molecules
```

## Project structure

```text
molgen/
├── chem.py              # validity / canonicalization / randomization (RDKit)
├── tokenizers.py        # atom-level regex SMILES tokenizer
├── selfies_tokenizer.py # SELFIES tokenizer (always-valid decoding)
├── data.py              # SmilesDataset, padding collate, augmentation, sample loader
├── synthetic.py         # synthetic dataset generators
├── vae.py               # Transformer β-TC-VAE
├── char_rnn.py          # GRU/LSTM language model
├── molgpt.py            # decoder-only Transformer
├── trainer.py           # AMP training loop
├── sampling.py          # temperature / top-k / top-p decoding
├── metrics.py           # validity, novelty, diversity, scaffolds, SNN, report
├── properties.py        # QED / logP / MW / SA score
├── checkpoint.py        # save & load model + tokenizer
├── cli.py               # `molgen` command-line interface
└── datasets/            # bundled sample SMILES
```

## Notes

The bundled `load_sample_smiles()` set is **synthetic** (assembled from
fragments) and intended for examples and tests; for real results, train on a
dataset such as MOSES, QM9, or ZINC. SELFIES mode guarantees 100% validity;
SMILES mode tends to learn the data distribution more faithfully.

## Contributing

Contributions are welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). Please run
`ruff check .`, `ruff format .`, and `pytest` before opening a pull request.

## Citation

If you use this toolkit in your work, please cite it via the **Cite this
repository** button on GitHub (metadata in [`CITATION.cff`](CITATION.cff)).

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE).
