Metadata-Version: 2.4
Name: instanexus
Version: 0.2.1
Summary: End-to-end workflow for de novo protein sequencing based on InstaNovo
Author-email: Marco Reverenna <marcor@dtu.dk>
License: MIT
Project-URL: Homepage, https://github.com/Multiomics-Analytics-Group/InstaNexus
Project-URL: Issues, https://github.com/Multiomics-Analytics-Group/InstaNexus/issues
Keywords: proteomics,bioinformatics,protein sequencing,de novo,assembly,mass spectrometry
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython>=1.85
Requires-Dist: pandas>=2.3.1
Requires-Dist: tqdm>=4.67.1
Requires-Dist: seaborn>=0.13.2
Requires-Dist: matplotlib>=3.8.0
Requires-Dist: plotly>=6.2.0
Requires-Dist: logomaker>=0.8
Requires-Dist: networkx>=3.3
Requires-Dist: scikit-learn>=1.3
Requires-Dist: upsetplot
Provides-Extra: docs
Requires-Dist: sphinx; extra == "docs"
Requires-Dist: sphinx-book-theme; extra == "docs"
Requires-Dist: myst-nb; extra == "docs"
Requires-Dist: ipywidgets; extra == "docs"
Requires-Dist: sphinx-new-tab-link!=0.2.2; extra == "docs"
Requires-Dist: jupytext; extra == "docs"
Requires-Dist: sphinx-copybutton; extra == "docs"
Provides-Extra: lint
Requires-Dist: mypy; extra == "lint"
Requires-Dist: ruff; extra == "lint"
Requires-Dist: codespell; extra == "lint"
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: jupytext; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="docs/source/assets/instanexus_logo 2.svg" width="600" alt="InstaNexus logo">
</p>

<p align="center"><em>A de novo protein sequencing workflow</em></p>

<p align="center">
  <a href="https://github.com/pre-commit/pre-commit"><img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit" alt="pre-commit"></a>
  <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"></a>
  <img src="https://img.shields.io/badge/license-MIT-green" alt="License">
  <img src="https://img.shields.io/badge/python-3.10+-blue" alt="Python">
</p>

---

## Table of Contents
- [Introduction](#introduction)
- [Features](#features)
- [Workflow Diagram](#workflow-diagram)
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Command-Line Usage](#command-line-usage)
- [Hyperparameter Optimization](#hyperparameter-optimization)
- [License](#license)
- [Acknowledgments](#acknowledgments)
- [References](#references)
- [Citation](#citation)

---

## Introduction

InstaNexus is a generalizable, end-to-end workflow for direct protein sequencing, tailored to reconstruct full-length protein therapeutics such as antibodies and nanobodies. It integrates AI-driven de novo peptide sequencing with optimized assembly and scoring strategies to maximize accuracy, coverage, and functional relevance.

This pipeline enables robust reconstruction of critical protein regions, advancing applications in therapeutic discovery, immune profiling, and protein engineering.

---

## Features

- 🧬 Supports De Bruijn Graph and Greedy-based assembly
- ⚗️ Handles multiple protease digestions (Trypsin, LysC, GluC, etc.)
- 🧹 Integrated contaminant removal and confidence filtering
- 🧩 Clustering, alignment, and consensus sequence reconstruction
- 🔗 Integrates with external tools:
  - [MMseqs2](https://github.com/soedinglab/MMseqs2) for fast clustering
  - [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/) for high-quality alignment
- 📊 Output-ready for downstream analysis and visualization

---

## Workflow Diagram

<p align="center">
  <img src="images/instanexus_panel.png" width="900" alt="InstaNexus Workflow">
</p>

---

## Repository Structure


| Folder / File | Description |
|----------------|-------------|
| `docs/` | Sphinx documentation, tutorials, and images |
| `fasta/` | FASTA reference and contaminant sequences |
| `inputs/` | Example input CSV files |
| `json/` | Metadata and parameter configuration files |
| `outputs/` | Generated results (created during execution) |
| `src/instanexus/` | Core InstaNexus package |
| `src/instanexus/main.py` | Runs the full pipeline |
| `src/instanexus/preprocessing.py` | Module for data cleaning |
| `src/instanexus/assembly.py` | Module for sequence assembly |
| `src/instanexus/clustering.py` | Module for clustering (mmseqs2) |
| `src/instanexus/alignment.py` | Module for alignment (clustalo) |
| `src/instanexus/consensus.py` | Module for consensus generation |
| `src/instanexus/opt/` | Grid search and optimization workflows |
| `tests/` | Pytest unit and integration tests |
| `pyproject.toml` | Package metadata, dependencies, and entry point |
| `.pre-commit-config.yaml` | Pre-commit hook configuration |

---

## Installation

InstaNexus requires Python 3.10+, [uv](https://docs.astral.sh/uv/), **MMseqs2**, and **Clustal Omega**.

- [uv](https://docs.astral.sh/uv/) — fast Python package manager
- [MMseqs2](https://github.com/soedinglab/MMseqs2)
- [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/)

---

## Getting Started

### Option 1: Install from PyPI

```bash
pip install instanexus
```

### Option 2: Install from Source (for Developers)

#### Clone the repository:

```bash
git clone git@github.com:Multiomics-Analytics-Group/InstaNexus.git
cd InstaNexus
```

#### Install uv (if not already installed):
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

#### Sync the environment:
```bash
uv sync --all-extras
```

#### Set up pre-commit hooks:
```bash
uv run pre-commit install --hook-type pre-commit --hook-type commit-msg
```

#### Verify the installation:
```bash
uv run instanexus --help
```

---

## Command-line usage

After installation (and adding the `[project.scripts]` entry point), you can run the entire InstaNexus pipeline using the `instanexus` command.

All parameters for preprocessing, assembly, clustering, and consensus are provided in a single call. The pipeline will automatically create a unique, timestamped output folder for that specific combination of parameters.

```bash
instanexus --help
```

Example: Run the full pipeline
This command runs the complete workflow:

Preprocesses the input CSV.

Assembles using dbg (De Bruijn graph).

Clusters the resulting scaffolds.

Aligns the clusters.

Generates consensus sequences.

```bash
instanexus \
    --input-csv inputs/bsa.csv \
    --folder-outputs outputs \
    --metadata-json-path json/sample_metadata.json \
    --contaminants-fasta-path fasta/contaminants.fasta \
    --assembly-mode dbg \
    --conf 0.9 \
    --kmer-size 7 \
    --size-threshold 12 \
    --min-overlap 3 \
    --min-seq-id 0.85 \
    --coverage 0.8
```

The results for this specific run will be saved in a unique directory, such as:```outputs/bsa/dbg_c0.9_ks7_mo3_ts12/```



---

## License

This project is licensed under the [MIT License](LICENSE).

---

## Acknowledgments

InstaNexus was developed at **DTU Biosustain** and **DTU Bioengineering**.

We are grateful to the **DTU Bioengineering Proteomics Core Facility** for maintenance and operation of mass spectrometry instrumentation.

We also thank the **Informatics Platform at DTU Biosustain** for their support during the development and optimization of InstaNexus.

Special thanks to the users and developers of:
- [MMseqs2](https://github.com/soedinglab/MMseqs2)
- [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/)

---

## References

1. Hauser, M., et al. **MMseqs2: ultra fast and sensitive sequence searching**. *Nature Biotechnology* 35, 1026–1028 (2016). https://doi.org/10.1038/nbt.3988  
2. Sievers, F., et al. **Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega**. *Molecular Systems Biology* 7, 539 (2011). https://doi.org/10.1038/msb.2011.75
3. Eloff, K., Kalogeropoulos, K., Mabona, A., Morell, O., Catzel, R., Rivera-de-Torre, E., ... & Jenkins, T. P. (2025). **InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments.** Nature Machine Intelligence, 1-15.

---

## Citation

If you find this project useful in your research or work, please cite it as:

Reverenna M., Nielsen M. W., Wolff D. S., Lytra E., Colaianni P. D., Ljungars A., Laustsen A. H., Schoof E. M., Van Goey J., Jenkins T. P., Lukassen M. V., Santos A., Kalogeropoulos K. (2025). *Generalizable direct protein sequencing with InstaNexus* [Preprint]. bioRxiv. https://doi.org/10.1101/2025.07.25.666861
