Metadata-Version: 2.4
Name: protein_mood
Version: 1.0.2
Summary: Multi-objective optimization for protein design
Author-email: Albert Cañellas <acanella@bsc.es>
License: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/AlbertCS/multiObjectiveOptimizationDesign
Project-URL: Repository, https://github.com/AlbertCS/multiObjectiveOptimizationDesign
Keywords: protein design,multi-objective optimization,genetic algorithm,bioinformatics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Requires-Dist: matplotlib>=3.7
Requires-Dist: seaborn>=0.12
Requires-Dist: biopython>=1.81
Requires-Dist: overrides>=7.4
Requires-Dist: tqdm>=4.66
Requires-Dist: icecream>=2.1
Provides-Extra: esm
Requires-Dist: torch>=2.0; extra == "esm"
Requires-Dist: transformers>=4.37; extra == "esm"
Provides-Extra: rosetta
Requires-Dist: pyrosetta-installer>=0.1.2; extra == "rosetta"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: ipykernel>=6.26; extra == "dev"
Requires-Dist: jupyterlab>=4.0; extra == "dev"
Provides-Extra: all
Requires-Dist: protein_mood[dev,esm,rosetta]; extra == "all"
Dynamic: license-file

# multiObjectiveDesign
MultiObjective Design for Protein Engineering

## Overview

`multiObjectiveDesign` is a Python toolkit for running iterative, multi-objective
optimisation loops on protein sequences.  At its core it combines a genetic
algorithm with a pluggable catalogue of metrics (e.g. ProteinMPNN, PyRosetta,
FrustraR), allowing you to trade off stability, designability, frustration and
custom objectives while keeping full visibility into each iteration.

## Installation

The core package is a standard Python project defined by `pyproject.toml` and
can be installed with either `pip` or `mamba`/`conda`.

### With pip

```bash
# from PyPI (once published)
pip install protein_mood

# from a clone of the repository
pip install .

# editable / development install
pip install -e ".[dev]"

# straight from GitHub
pip install "git+https://github.com/AlbertCS/multiObjectiveOptimizationDesign.git"
```

A plain install pulls in only the lightweight scientific stack (numpy, pandas,
scipy, scikit-learn, matplotlib, seaborn, biopython, overrides, tqdm,
icecream). The heavier predictors are grouped into optional extras:

| Extra      | Enables                                              |
| ---------- | ---------------------------------------------------- |
| `esm`      | Deep-learning metrics — `torch`, `transformers` (ESM2, ESMC, ESMFold2, LigandMPNN) |
| `rosetta`  | `pyrosetta-installer` for the PyRosetta-based metrics |
| `dev`      | `pytest`, `ipykernel`, `jupyterlab`                  |
| `all`      | `esm` + `rosetta` + `dev`                            |

```bash
pip install ".[esm]"      # deep-learning metrics
pip install ".[all]"      # everything
```

**PyRosetta** is not distributed on PyPI. After installing the `rosetta`
extra, download the wheel with:

```bash
python -m pyrosetta_installer  # or: pyrosetta-installer
```

### With mamba / conda

An `environment.yml` is provided that builds an env named `mood` with the core
stack (plus notebook tooling and PyTorch) and pip-installs the package in
editable mode:

```bash
mamba env create -f environment.yml   # or: conda env create -f environment.yml
mamba activate mood
```

Once the conda-forge package is published you can also install it directly:

```bash
mamba install -c conda-forge protein_mood
```

For fully-pinned, reproducible environments used on HPC, see the lock files in
[`configs/`](configs/) (`mood-dev.yml`, `mood-esmc.yml`, `mood-esmfold2.yml`).

### ProteinMPNN model weights

The ProteinMPNN weights (~70 MB) are **not** bundled with the package. The
first time a ProteinMPNN metric runs, the required `.pt` file is downloaded,
checksum-verified and cached under `~/.cache/mood/` (override with
`$MOOD_CACHE_DIR`). You can control this:

| Variable | Effect |
| -------- | ------ |
| `MOOD_PROTEINMPNN_WEIGHTS` | Directory of pre-staged weights to use as-is — **no download**. Use this on offline/air-gapped clusters (e.g. MareNostrum). Layout: `<type>_model_weights/<name>.pt`. |
| `MOOD_CACHE_DIR` | Where downloaded weights are cached (default `~/.cache/mood`). |
| `MOOD_PROTEINMPNN_BASE_URL` | Base URL to download from (default: upstream `dauparas/ProteinMPNN`). |

Passing `path_to_model_weights` to the ProteinMPNN metric bypasses all of the
above and uses that directory directly.

### Verify the install

```bash
mood --help                 # console entry point
python -c "import mood; print('ok')"
```

## Versioning & releases

The version is **derived from git tags** by
[`setuptools-scm`](https://setuptools-scm.readthedocs.io) — there is no version
number to edit by hand. Releases are cut by tagging `vX.Y.Z`, which is
automated by [`.github/workflows/release.yml`](.github/workflows/release.yml)
on every push to `main`:

- default for any merge / commit → **patch** (`1.0.0 → 1.0.1`)
- PR labelled `minor` → **minor** (`1.0.0 → 1.1.0`)
- PR labelled `major` → **major** (`1.9.0 → 2.0.0`)
- manual runs (`workflow_dispatch`) let you pick the bump

The workflow creates the tag and a GitHub Release; the version is also embedded
in the built sdist so PyPI/conda-forge builds resolve it without git metadata.

### Project layout

- `mood/multiObjectiveOptimization.py` – high-level orchestration that prepares
  folders, restores previous runs, evaluates metrics and persists artefacts.
- `mood/optimizers/` – optimisation strategies.  Currently the genetic
  algorithm is implemented with modular crossover/mutation helpers and a rich
  mutation biasing subsystem.
- `mood/metrics/` – collection of metric classes with a shared interface.  Each
  metric computes a dataframe of scores for the candidate sequences and exposes
  selection orientation (min/max) metadata used during ranking.
- `mood/base/` – lightweight data structures (sequences, logging, state) shared
  across the codebase.
- `mood/utils/` – utilities for structure handling, plotting, ProteinMPNN
  wrappers and misc helpers required by the optimiser/metrics.
- `configs/` – ready-to-run JSON configurations demonstrating typical setups.
- `tests/` – pytest/unittest suites covering the core pieces (GA, selection
  strategies, CLI integration, metrics) plus example notebooks for exploratory
  runs.

For a detailed catalogue of available metrics, their objectives, and example
configurations, see For a detailed catalogue of available metrics, see [Metrics overview](docs/METRICS.md)

## CLI Generator

The CLI now *generates* ready-to-run replica scripts instead of executing an optimisation in-place. Given a JSON/YAML config it produces:

- `setUp_<folder_name>_<replica>.py` scripts mirroring our manual setup style.
- A SLURM array runner that dispatches the correct setup per `SLURM_ARRAY_TASK_ID`.

### Quick start

```
python3 -m mood.cli \
  --config configs/toy_example.json \
  --replicas 2 \
  --seed-start 1234 \
  --seed-step 1
```

Outputs are written to `folder_name/` (or `--output-prefix`). Each replica inherits your config, with `{seed}` placeholders replaced by `seed-start + index * seed-step`.

### HPC-friendly generation

Provide a preamble snippet and Python interpreter to match your cluster:

```
python3 -m mood.cli \
  --config configs/toy_example.json \
  --replicas 4 \
  --seed-start 1235 \
  --python-exec /path/to/conda/env/bin/python \
  --preamble-file configs/runner_preamble.sh \
  --ntasks 80 --cpus-per-task 1 --time 02-00:00:00
```

Submit the generated `folder_name/runner_array.sh` via `sbatch`. Re-run the CLI with `--overwrite` to refresh existing scripts. Check `python3 -m mood.cli --help` for the full option list.

## Typical workflow

1. **Prepare inputs** – provide a native PDB (or scaffold), choose metrics, and
   declare mutable/fixed positions in your config.  The metrics module exposes
   helpers to pre-compute ProteinMPNN priors or frustration files if needed.
2. **Tune optimisation knobs** – set population size, mutation/crossover cycle,
   parent-selection strategy (rank, crowding, objective bias) and iteration
   count in the configuration file.
3. **Generate runners or integrate directly** – use the CLI generator to emit
   replica setup scripts for HPC runs, or instantiate
   `MultiObjectiveOptimization` directly inside your own pipeline.
4. **Inspect outputs** – each iteration folder contains pickled sequences,
   per-chain dataframes, optional Pareto plots, and the metric-specific raw
   artefacts.  The most recent iteration can be resumed without losing
   progress.

## Development notes

- All metrics inherit from `mood.metrics.metric.Metric`; to add a new metric,
  implement `compute`, `setup_iterations_inputs`, and `clean`, populating the
  `state` (orientation) and `objectives` lists.
- Optimisers rely on `AlgorithmDataSingleton` to store sequences; when
  implementing alternative algorithms, follow the contract exposed by
  `mood/optimizers/optimizer.py`.
- The repository includes high-level regression tests (`tests/test_top7.py`,
  `tests/test_mood.py`) and targeted unit tests for selection strategies,
  mutation handlers, and metrics.  Run `python -m pytest` before submitting
  changes.

## Further reading

- *Zitzler, Deb, Thiele (2000)* – origin of the ZDT benchmarks used in
  `mood/optimizers/benchmarks.py`.
- ProteinMPNN and PyRosetta official documentation for understanding the
  external predictors invoked by this project.
