Metadata-Version: 2.4
Name: ghostfold
Version: 0.1.3
Summary: Accurate, database-free protein folding from single sequences using structure-aware synthetic MSAs
License-Expression: MIT
License-File: LICENSE
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: biopython>=1.80
Requires-Dist: matplotlib>=3.5
Requires-Dist: numpy>=1.21
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: scikit-learn>=1.0
Requires-Dist: sentencepiece>=0.1.96
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.30
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Description-Content-Type: text/markdown

![](https://img.shields.io/pypi/v/ghostfold.svg?colorB=blue)
[![tests](https://github.com/brineylab/ghostfold/actions/workflows/pytest.yaml/badge.svg)](https://github.com/brineylab/ghostfold/actions/workflows/pytest.yaml)
![](https://img.shields.io/pypi/pyversions/ghostfold.svg)
![](https://img.shields.io/badge/license-MIT-blue.svg)

# GhostFold

**Accurate, database-free protein folding from single sequences using structure-aware synthetic MSAs**

---

## Overview

**GhostFold** is a next-generation protein folding framework that predicts 3D structures directly from single sequences — without relying on large evolutionary databases. By generating **synthetic, structure-aware multiple sequence alignments (MSAs)**, GhostFold achieves high accuracy while remaining lightweight and portable.

---

## Installation

### 1. Install PyTorch with CUDA

GhostFold requires PyTorch with CUDA support. Install the appropriate version for your system **before** installing GhostFold:

```bash
# Example for CUDA 12.1 (adjust for your CUDA version)
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

Refer to the [PyTorch installation guide](https://pytorch.org/get-started/locally/) for platform-specific instructions.


### 2. Install localcolabfold

Operations that involve protein structure prediction (`ghostfold run` and `ghostfold fold`) require a working local ColabFold runtime. From the GhostFold repository root:

```bash
chmod +x scripts/install_localcolabfold.sh
./scripts/install_localcolabfold.sh
```

If you prefer cloud-based structure prediction, you can use the generated pseudoMSAs directly in [ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) by selecting **"custom_msa"** under MSA settings and uploading the pseudoMSA generated by GhostFold.


### 3. Install GhostFold

```bash
pip install ghostfold
```

For development:

```bash
git clone https://github.com/brineylab/ghostfold.git
cd ghostfold
pip install -e ".[dev]"
```

---

## Hugging Face Authentication

GhostFold uses ProstT5 from the Hugging Face Hub. You may need to configure a Hugging Face access token:

```bash
huggingface-cli login
```

See the [Hugging Face documentation](https://huggingface.co/docs/hub/security-tokens) for details.

---

## CLI Usage

GhostFold provides a single command-line tool with five subcommands:

### Generate pseudoMSAs

```bash
ghostfold msa --project-name my_project --fasta-path query.fasta
```

Options:
- `--config PATH` — Custom YAML config (overrides bundled defaults)
- `--recursive` — Recursively search directories for FASTA files
- `--coverage FLOAT` — Coverage values (repeatable, default: 1.0)
- `--num-runs INT` — Independent runs per sequence (default: 1)
- `--evolve-msa` — Enable MSA evolution with substitution matrices
- `--mutation-rates JSON` — Mutation rates per matrix
- `--sample-percentage FLOAT` — Fraction of sequences to evolve (default: 1.0)
- `--plot-msa-coverage` — Generate MSA coverage heatmaps
- `--no-coevolution-maps` — Skip coevolution map generation

### Run structure prediction

```bash
ghostfold fold --project-name my_project
```

Options:
- `--subsample` — Enable MSA subsampling (multiple depth levels)
- `--mask-fraction FLOAT` — Mask a fraction of MSA residues (0.0-1.0)
- `--num-gpus INT` — Override auto-detected GPU count
- `--localcolabfold-dir PATH` — Path to localcolabfold pixi checkout (default: `./localcolabfold`)
- `--colabfold-env TEXT` — Legacy mamba env name for ColabFold fallback (default: `colabfold`)

### Full pipeline (MSA + folding)

```bash
ghostfold run --project-name my_project --fasta-path query.fasta
```

Combines all options from `msa` and `fold` commands.


### Mask MSA files

```bash
ghostfold mask --input-path input.a3m --output-path masked.a3m --mask-fraction 0.15
```

### Calculate Neff scores

```bash
ghostfold neff my_project/
```

### Version

```bash
ghostfold --version
```

---

## Python API

GhostFold can also be used as a Python library:

```python
from ghostfold import run_pipeline, mask_a3m_file, calculate_neff, MSA_Mutator
from ghostfold.core.config import load_config

# Load config with optional overrides
config = load_config("my_config.yaml")

# Run MSA generation pipeline
run_pipeline(
    project="my_project",
    fasta_path="query.fasta",
    config=config,
    coverage_list=[1.0],
    evolve_msa=True,
    mutation_rates_str='{"MEGABLAST": 5, "PAM250": 20, "BLOSUM62": 10}',
    sample_percentage=1.0,
    plot_msa=False,
    plot_coevolution=False,
)
```

---

## References

* [ProstT5: Protein Language Modeling](https://github.com/mheinzinger/ProstT5?tab=readme-ov-file)
* [ColabFold: AlphaFold Simplified](https://github.com/sokrypton/ColabFold)
