Metadata-Version: 2.4
Name: mpp-lumping
Version: 0.9.1
Summary: Lumping tool for state trajectories
Project-URL: Homepage, https://github.com/moldyn/MPP
Project-URL: Issues, https://github.com/moldyn/MPP/issues
Author-email: Felix Guischard <felix0180@web.de>
License: MIT License
        
        Copyright (c) 2025 Biomolecular Dynamics
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: anytree>=2.12.1
Requires-Dist: bezier>=2024.6.20
Requires-Dist: fa2-modified>=0.3.10
Requires-Dist: mdtraj>=1.11.0
Requires-Dist: msmhelper>=1.1.1
Requires-Dist: networkx>=3.5
Requires-Dist: numpy>=2.2.5
Requires-Dist: pygpcca>=1.0.4
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: seaborn>=0.13.2
Requires-Dist: tqdm>=4.67.1
Description-Content-Type: text/markdown

# The Most Probable Path Algorithm for Reducing the Number of States in a State Trajectory
This package implements the most probable path (MPP) algorithm, which is used to coarse-grain the number of discrete states of a Markov process. Based on a microstate trajectory, a Markov state model is estimated utilising [msmhelper](https://github.com/moldyn/msmhelper). The transition probabilities and optional other descriptors are used to combine states in such a way that the final macrostates exhibit a given minimum population and metastability.

In the most basic example, a lumping tree is generated by iteratively selecting the least stable state and merging it with the state, to which its transition probability is highest. In a second step, the tree is parsed in reverse order (starting at the root) in order to identify the macrostates. For details, see publication tbd.

## Features
- Perform the most probable path (MPP) algorithm on a given microstate trajectory
- Variety of analysis plots
- Multi trajectory support
- Extensions to the basic algorithm
  - Similarity by Kullback-Leibler divergence of transition probabilities
  - Incorporation Jenson-Shannon divergence of a feature (e.g. contact distances)
  - Stochastic lumping
- Three levels of user interface
  - as integration in your Python code (use the `MPP.Lumping` or `MPP.run.Data` objects)
  - as Python module (`python -m MPP.run --help`)
  - in your Snakemake workflow (see ´workflow`)

## Installation
The package is available in the Python Package Index (PyPI) and can be installed via `pip`:

```bash
python -m pip install mpp-lumping
```

## Usage
Dependent on your skills and needs, the module can be used at three different entry points:
- Integrate the central `MPP.Lumping` or `MPP.run.Data` object in your Python pipeline.
- Use the high-level Python interface (`python -m MPP.run ...`) via the command line.
- In a [Snakemake](https://snakemake.github.io/) workflow where you only need to provide the configuration of your system and you're ready to go.

### Config File
Config files (YAML files) are used to pass the information of where the files are located and some lumping parameters. Below you see a reference config file with all possible parameters. Note that only the following fields are mandatory: `source`, `microstate trajectory`, `multi feature trajectory`, `frame length`, `lagtime`, `pop_thr`, `q_min`. Please refer to the wiki for a detailed description of the parameters (tbd).

```yaml
source: data/HP35/input/ # root directory of the other files

microstate trajectory: microstate_trajectory # the microstate trajectory
multi feature trajectory: contact_distances_trajectory # the feature trajectory, each line contains the feature values of the respective feature
cluster file: clusters # Defines the contact clusters, each row contains the contact indices of a contact cluster
contact index file: contacts.ndx # residue indices for contacts.
limits: null # contains the lengths of each trajectory when multiple trajectories are concatenated

topology file: structure.pdb # topology file used with the xtc trajectory
xtc file: trajectory.xtc # the xtc trajectory file
helices: helices # definition of secondary structure elements

contact threshold: null # Threshold for in the feature space below which e.g. a contact is considered to be formed.
frame length: 0.2 # in ns / frame
lagtime: 50 # in frames
pop_thr: 0.005 # population threshold for macrostates
q_min: 0.5 # minimum metastability of macrostates

n timescales: 3 # number of timescales to plot in the implied timescales plot. 3 is the default.

# For stochastic lumping
stochastic:
  method: n
  param: 2
  n: 10

# PyMol rendering
view: view # contains the view information for PyMol
width: 500 # width of the image in px
height: 500 # height of the image in px
```

### Python Module
Running the package requires a configuration file as described above as a first parameter. The following two positional arguments define the similarity between two states. First comes the dynamic similarity (`T`, `KL`, `none`), second the geometric similarity (`JS`, `none`). For reference, G-PCCA can be performed by issuing `gpcca` first and then the number of macrostates to create. Pass `ref` in order to take the number of macrostates from the `T none` lumping (the similarity between states corresponds to the transition probability between them).

Provide the target file (where to store the plot) with the option `-o`. The lumping tree (the result of the first, potentially intense step) is defined by a Z matrix, which can be stored and loaded by the `-Z` option. If the provided path exists, this file is loaded as Z matrix. For the `--rmsd` option, the same holds true as it is intense to calculate. With the `--rmsd-feature` option, you can select if the RMSD should be calculated for the C alpha atoms in the Cartesian coordinate space or the space of your feature, e.g. contact distances. `-r` allows you to draw N random frames of each macrostate and `-p` produces desired plots. `--get-least-moving-residues` saves the indices in the state trajectory of the least moving residues per macrostate to a text file. This allows to find the residues, which participate in the most stable contact distances for each macrostate.

```bash
~$ python -m MPP.run --help
usage: Perform MPP on MD simulation data [-h] [-o OUT] [-Z Z] [--rmsd RMSD]
                                         [--rmsd-feature RMSD_FEATURE] [--xtc-stride XTC_STRIDE] [-r N]
                                         [-p PLOT]
                                         [--get-least-moving-residues GET_LEAST_MOVING_RESIDUES]
                                         data_specification d g

This program allows for the analysis of MD data utilizing the most probable path algorithm. It allows
for easy plotting of different quality measures.

positional arguments:
  data_specification    yaml file containing specification of files and parameters of the simulation
  d                     dij to be used.
  g                     gij to be used.

options:
  -h, --help            show this help message and exit
  -o OUT, --out OUT     Where to store the plot
  -Z Z                  Perform MPP and write the Z matrix
  --rmsd RMSD           Generate and write RMSD to file
  --rmsd-feature RMSD_FEATURE
                        'CA' for C-alpha RMSD or 'feature' for feature RMSD (default: CA)
  -r N, --draw-random N
                        Draw N random frames for each macrostate
  -p PLOT, --plot PLOT  Generate listed plots. Possible arguments include dendrogram, timescales,
                        sankey, contacts, macrotraj, ck_test, rmsd, delta_rmsd, state_network,
                        macro_feature, stochastic_state_similarity, relative_implied_timescales,
                        transition_matrix, transition_time and macrostate_trajectory. The latter writes
                        the macrostate trajectory to a txt file.
  --get-least-moving-residues GET_LEAST_MOVING_RESIDUES
                        Write least moving residues for each macrostate to a file
```

### The Snakemake Workflow
Snakemake is a workflow organization tool and used here to provide a high level user interface. In general, you only need to tell snakemake which file you would like to have, e.g.

```bash
snakemake --cores 'all' --sdm conda -p data/HP35/results/{t,t_js,kl,kl_js}/dendrogram.pdf --cache
```

Explanation of some flags:

- `--cores` Number of cores to utilize. 'all' for all cores.
- `--software-deployment-method, --sdm` Use conda to deploy software environment.
- `--snakefile, -s` Use a non-local snakefile. Use e.g. `-s /data/evaluation/MPP/stochastic_MPP_Felix/tools/MPP/workflow/Snakefile`
- `--dry-run, --dryrun, n` Do not execute anything, just print out the jobs that would be run.
- `--cache` So rules may be eligible for caching. Enable it with this option.
- `--force, -f` Force recreation of the given file(s).
- `--printshellcmds, -p` Print out the shell commands that will be executed.

More information can be found here: [Snakemake Documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html)

Note that bash parameter expansion (the use of `{` and `}`) is possible to create e.g. several diagrams at once for multiple systems and/or setups.

## Data Directory Structure

```bash
data/
├── System1
│   ├── input
│   │   ├── clusters
│   │   ├── config.yml
│   │   ├── contact_distances_trajectory
│   │   ├── contacts.ndx
│   │   ├── microstate_trajectory
│   │   ├── README.md
│   │   ├── topology.pdb
│   │   ├── trajectory.xtc
│   │   └── view
│   └── results
│       ├── t
│       │   ├── output_file1
│       │   ├── output_file2
│       │   └── ...
│       ├── kl
│
├── System2
```
