Metadata-Version: 2.4
Name: microstateledger
Version: 0.1.1
Summary: Microstate version-control system for drug discovery workflows
Author: MicrostateLedger Team
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://example.invalid/microstateledger
Keywords: cheminformatics,microstate,provenance,rdkit,docking,md
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.9
Requires-Dist: pyyaml>=6
Requires-Dist: tomli; python_version < "3.11"
Provides-Extra: cheminformatics
Requires-Dist: rdkit-pypi>=2022.9.5; extra == "cheminformatics"
Requires-Dist: dimorphite-dl>=1.2; extra == "cheminformatics"
Provides-Extra: docking
Requires-Dist: meeko>=0.5; extra == "docking"
Provides-Extra: md
Requires-Dist: openmm>=8.0; extra == "md"
Provides-Extra: charges
Requires-Dist: numpy>=1.23; extra == "charges"
Provides-Extra: repro
Requires-Dist: dvc>=3.0; extra == "repro"
Requires-Dist: datalad>=0.18; extra == "repro"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov>=4; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Dynamic: license-file

# MicrostateLedger

MicrostateLedger is a microstate version-control system for drug discovery workflows.
It turns protonation, tautomer, stereochemistry, conformation, and mapping decisions into first-class tracked objects.

The project is designed for pipelines that span multiple tools (RDKit -> Docking -> MD -> QM) and need traceability, reproducibility, and team-safe collaboration.

## Table of Contents

- [1. Why MicrostateLedger](#1-why-microstateledger)
- [2. Core Capabilities](#2-core-capabilities)
- [3. System Architecture](#3-system-architecture)
- [4. Repository Layout](#4-repository-layout)
- [5. Requirements](#5-requirements)
- [6. Installation](#6-installation)
- [7. Quick Start](#7-quick-start)
- [8. Configuration](#8-configuration)
- [9. CLI Command Guide](#9-cli-command-guide)
- [10. Data Model and Stable IDs](#10-data-model-and-stable-ids)
- [11. Atom Mapping and Reversibility](#11-atom-mapping-and-reversibility)
- [12. Reproducibility and Auditability](#12-reproducibility-and-auditability)
- [13. Optional Engines and Graceful Degradation](#13-optional-engines-and-graceful-degradation)
- [14. License](#14-license)

## 1. Why MicrostateLedger

In practical molecular workflows, the same compound can appear in many chemically distinct states:

- Protonation states
- Tautomers
- Stereochemical variants (including undefined centers)
- Multiple conformers
- Tool-specific atom indexing or topology representations

Without strict tracking, teams frequently lose consistency between stages.
MicrostateLedger solves this by introducing ledger-backed object lineage and stable IDs across the full workflow.

## 2. Core Capabilities

- Stable IDs for compounds, microstates, conformers, poses, receptors, and artifacts
- Full provenance in SQLite (`runs`, `decisions`, `anomalies`, `edges`, `artifacts`)
- Policy-driven enumeration and pruning (`Top-k`, caps, stage-specific rules)
- Sidecar atom mapping for cross-software consistency
- Diff utilities for chemistry, geometry, charge, and pipeline execution
- Batch execution with resume support and crash-safe state files
- Optional DVC/DataLad/lwreg integrations for data governance

## 3. System Architecture

MicrostateLedger is organized in three layers:

1. Ledger Core (`msl/`)
- CLI, DB schema access, config handling, ID generation, and run/audit recording.

2. Tool Drivers (`scripts/`)
- RDKit ingest/enumeration, conformer generation, docking prep/ingest helpers, charge tools, MD feedback checks.

3. Object Store (`objects/`, `runs/`, `reports/`)
- All generated artifacts, run outputs, and report files.

Each workflow action writes both files and ledger records, so every output can be traced to inputs, policy, tool, and run context.

## 4. Repository Layout

- `msl/`: Python package source
- `scripts/`: stage driver scripts
- `schemas/`: SQL schema definitions
- `policies/`: policy templates
- `tests/`: unit and regression tests
- `bin/`: wrapper scripts for environment-isolated execution
- `msl.toml`: default project config
- `README.md`: user guide

## 5. Requirements

Minimum:

- Linux (recommended)
- Python 3.10+
- SQLite 3

Optional tooling by stage:

- RDKit, Dimorphite (enumeration/perception)
- Meeko, AutoDock Vina (docking)
- OpenFF Interchange, OpenMM (MD parameterization/smoke runs)
- PyPE_RESP / external QM tools (charge workflows)
- DVC, DataLad, lwreg (governance integrations)

SQLite operational notes:

- Prefer local SSD paths for `ledger.sqlite`; SQLite on NFS/shared mounts may have lock latency.
- For high parallelism, use one ledger per job and merge results later via exported artifacts/provenance.
- If you hit `database is locked`, retry with serialized writers (or separate ledgers) instead of forcing shared writes.

## 6. Installation

### 6.1 Install from PyPI (recommended for users)

```bash
python -m pip install MicrostateLedger
```

CLI entrypoint after install:

```bash
msl --help
```

### 6.2 Install from source (development)

```bash
git clone https://github.com/woshuizhaol/MicrostateLedger.git
cd MicrostateLedger
python -m pip install -e .
```

### 6.3 Optional extras

```bash
python -m pip install "MicrostateLedger[cheminformatics]"
python -m pip install "MicrostateLedger[docking]"
python -m pip install "MicrostateLedger[md]"
python -m pip install "MicrostateLedger[charges]"
python -m pip install "MicrostateLedger[repro]"
```

### 6.4 Shared-server wrapper mode

If your team uses per-tool isolated environments, use wrappers in `bin/`:

```bash
./bin/msl --help
```

This mode is useful on shared compute servers where tools live in different envs.

## 7. Quick Start

### 7.1 Initialize the ledger

```bash
msl init
```

### 7.2 Ingest and perceive

```bash
msl ingest "CCO"
msl perceive <compound_id>
```

### 7.3 Enumerate microstates

```bash
msl enumerate <compound_id>
```

### 7.4 Generate conformers

```bash
msl conformers <microstate_id> --n 50
```

### 7.5 Docking preparation and ingest

```bash
msl dock-prep <microstate_id> <conformer_id>
msl dock-ingest <conformer_id> <pose.pdbqt> --score -7.5
msl dock-select <microstate_id> --k 20
msl select-final <microstate_id> --pose-id <pose_id>
```

### 7.6 Charge and MD setup

```bash
msl charge <microstate_id> --method pype_resp --auto-qm --conformer-id <conformer_id>
msl md-param <microstate_id> --charges-json <charges.json>
```

### 7.7 MD feedback and transition tracking

```bash
msl md-feedback <microstate_id> <probe.sdf>
msl md-transition <microstate_id> <probe.sdf>
```

### 7.8 One-command demo

```bash
bash scripts/demo_full_pipeline.sh "CCO" demo_ethanol
```

## 8. Configuration

The default project config file is `msl.toml`.

Typical keys include:

- `ledger_db`: SQLite path
- `objects_dir`: artifact root
- `runs_dir`: run output root
- `reports_dir`: reports root
- `policy_path`: active policy YAML
- `envs.<stage>`: per-stage environment selection

Policy behavior is controlled through `policies/default.yaml`, including:

- pH range and enumeration limits
- top-k/cap constraints
- stage-level keep/drop behavior
- optional fallback behavior when tools are missing

## 9. CLI Command Guide

### 9.1 Core lifecycle

- `msl init`: initialize ledger database
- `msl migrate`: apply schema migrations
- `msl ingest`: register a standardized compound
- `msl perceive`: generate risk signals before expansion
- `msl enumerate`: generate microstates
- `msl conformers`: create conformers

### 9.2 Docking and selection

- `msl receptor-add`: register receptor structure
- `msl dock-prep`: export docking-ready ligand + mapping sidecar
- `msl dock-ingest`: register docking poses and scores
- `msl dock-select`: rank/select top poses
- `msl select-final`: mark final microstate candidate for downstream

### 9.3 Charges and MD

- `msl charge`: generate/import per-atom charges
- `msl md-param`: build MD-ready system artifacts
- `msl md-feedback`: detect MD anomalies and record suggestions
- `msl md-transition`: record observed state transition

### 9.4 Diff and provenance

- `msl diff-microstate`: compare chemistry-level state definitions
- `msl diff-conformer`: compare geometry (RMSD/torsions)
- `msl diff-charge`: compare charge vectors by canonical atom ids
- `msl diff-pipeline`: compare run-level stage/tool/params
- `msl prov-export`: export provenance as PROV-JSON

### 9.5 Maintenance and ecosystem

- `msl clean-invalid-microstates`: remove sanitize-failed records
- `msl batch`: resumable batch execution
- `msl dvc-track`, `msl dvc-init`: DVC utilities
- `msl datalad-init`, `msl datalad-save`: DataLad utilities
- `msl lwreg-init`, `msl lwreg-register`: lwreg integration
- `msl demo`: full demonstration pipeline

## 10. Data Model and Stable IDs

Core entities:

- `Compound`
- `Microstate`
- `Conformer`
- `Pose`
- `System`
- `Decision`
- `Anomaly`
- `Edge`
- `Artifact`

ID strategy highlights:

- `CompoundID`: derived from registration hash
- `MicrostateID`: derived from `fixedH_inchi + charge + stereo + coordination signature`
- `ConformerID/PoseID/ArtifactID`: derived from content hash

This allows deterministic identity under fixed policy/tooling and explicit tracking of intended variability.

## 11. Atom Mapping and Reversibility

MicrostateLedger uses sidecar mapping files to preserve canonical atom identity across tool conversions.

Typical artifacts:

- Docking: `ligand.map.json`
- MD/topology: `atommap.json`

Goal:

- A canonical atom id can be traced from RDKit representation to docking and MD representations.
- Mapping evidence remains attached to run/artifact records in the ledger.

## 12. Reproducibility and Auditability

MicrostateLedger records provenance at each stage:

- `runs`: stage, tool, params, status
- `artifacts`: path + hash linkage
- `decisions`: why objects were kept/dropped
- `anomalies`: what failed/drifted and suggested next actions
- `edges`: graph lineage between objects

For reproducibility analysis:

- Use repeated runs under fixed seed/policy and compare IDs/artifact hashes
- Export pipeline diffs and PROV-JSON for external review

## 13. Optional Engines and Graceful Degradation

Some stages are optional and may be unavailable in minimal installs.

Expected behavior:

- Core stages still run where dependencies exist.
- Missing optional tools should produce explicit decisions/logs instead of silent failure.
- You can combine minimal core usage with selectively enabled advanced stages.

## 14. License

See `LICENSE`.
