Metadata-Version: 2.4
Name: clustrX
Version: 1.0.0
Summary: clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection
Author-email: Mario Benítez-Prián <mario.benitezprian@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/mario-benitez-prian/clustrX
Project-URL: Repository, https://github.com/mario-benitez-prian/clustrX
Project-URL: Bug Tracker, https://github.com/mario-benitez-prian/clustrX/issues
Keywords: bioinformatics,clustering,blast,hmmer,sequences,graphs,protein-families,similarity-search,sequence-clustering
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: polars>=0.19.0
Requires-Dist: igraph>=0.10.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: psutil>=5.8.0
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"

# clustrX: Highly Robust and Sensitive Protein Clustering

[![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/clustrX/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

**clustrX** is a high-performance framework designed to transform sequence similarity search results into biologically coherent protein families. By modeling homology as a weighted mathematical network and applying the **Leiden community detection algorithm**, `clustrX` provides a sensitive and robust solution for clustering sequences, especially in complex scenarios involving remote homology and short peptides.

---

## 🚀 Key Features

*   **Leiden Community Detection**: Beyond simple links, `clustrX` identifies densely connected communities, ensuring high internal cohesion and preventing artificial family merging (e.g., due to domain bridges).
*   **Agnostic Input**: Works with results from **BLAST**, **Diamond**, **MMseqs2**, and **HMMER**. Or others using the custom input option.
*   **Dynamic Coverage Filter**: Our recommended approach to handle sequences of varying lengths to obtain the most reliable and biologically sound results.
*   **Ultra-Fast Performance**: Powered by `Polars` (Rust-based) for data processing and `igraph` (C-based) for network analysis.
*   **Integrated Workflow**: From similarity hits to Multiple Sequence Alignments (MSAs) in a single command.

---

## 📦 Installation

You can install `clustrX` using two main methods. Note the difference in dependency management:

### Option A: Via Conda (Recommended)
This is the easiest way as it automatically installs all external dependencies, including **MAFFT** for alignments.
```bash
conda install -c bioconda clustrx
```

### Option B: Via Pip (Using a Virtual Environment)
To avoid conflicts with other packages and ensure the `clustrx` command is correctly recognized by your system (avoiding PATH issues), we highly recommend using a virtual environment:

1.  **Create a new environment**:
    ```bash
    python -m venv clustrx_env
    ```
2.  **Activate it**:
    *   **Windows**: `clustrx_env\Scripts\activate`
    *   **Linux/macOS**: `source clustrx_env/bin/activate`
3.  **Install**:
    ```bash
    pip install clustrX
    ```

> [!TIP]
> **If the `clustrx` command is not recognized** after installation (common on Windows), it is likely because the installation directory is not in your system's PATH. You can either add it manually or use the following foolproof method:
> `python -m clustrx [arguments]`

*Note: If you use Pip, remember that you must **install MAFFT manually** on your system if you plan to use the `--mafft` option.*

---

## ⚙️ Input Formats & Requirements

`clustrX` is designed to be a post-processing layer. It requires two main inputs:
1.  **Similarity Hits**: A tabular file (BLAST-like or HMMER).
2.  **Sequences**: A FASTA file containing the sequences referenced in the hits.

### Using BLAST
`clustrX` works natively with the default tabular output of BLAST (`-outfmt 6`).
```bash
blastp -query sequences.fasta -db database -out hits.tsv -outfmt 6
```

### Using Diamond or MMseqs2
If you use these tools, you **must** ensure the output is in **BLAST tabular format (outfmt 6)**:

*   **Diamond**:
    ```bash
    diamond blastp -q query.fasta -d db.dmnd -o hits.tsv --outfmt 6
    ```
*   **MMseqs2**:
    ```bash
    mmseqs easy-search query.fasta target.fasta hits.tsv tmp --format-mode 0
    ```

### Using HMMER
HMMER outputs require specific flags depending on the filtering level you need:

*   **`domtblout` (Recommended)**: Use the `--domtblout` flag in `hmmsearch` or `phmmer`. This format provides alignment coordinates, which are **required** for using the **Dynamic Coverage** filter.
    ```bash
    hmmsearch --domtblout hits.domtblout profile.hmm database.fasta
    ```
*   **`tblout`**: Use the `--tblout` flag. Note that this format lacks coordinate information; therefore, **Dynamic Coverage cannot be applied** (only E-value and Bitscore filters will be used).
    ```bash
    hmmsearch --tblout hits.tblout profile.hmm database.fasta
    ```

---

## 🧬 The Power of Dynamic Coverage

We strongly recommend using the **Dynamic Coverage** mode (`--coverage dynamic`) for most scientific applications. For more information about this, please, read the paper.

Standard clustering methods often use fixed thresholds that fail to resolve relationships between sequences of very different sizes. Our dynamic filter uses a **hyperbolic decay function** (calibrated with a 50-residue scale factor) that:
1.  Increases stringency for **short peptides** (up to 0.8 coverage) to filter out statistical noise.
2.  Gradually relaxes for **larger proteins** (down to 0.4 coverage) to maximize sensitivity in detecting remote homology.

---

## 🛠️ Workflow & Usage

The `clustrX` pipeline follows a clear 3-step logic:

1.  **Filter**: Hits are filtered based on E-value, Bitscore, and (recommended) Dynamic Coverage.
2.  **Cluster**: A similarity network is built where edges are weighted by Bitscore, then partitioned using Leiden algorithm.
3.  **Output**: Results are exported. **Note: Fasta generation and alignments are optional.**

### Example: Recommended Scientific Run
```bash
clustrx -i hits.tsv -f sequences.fasta --coverage dynamic --write-fasta --mafft --outdir results_full
```
*   `--write-fasta`: (Optional) Creates a FASTA file for each generated cluster.
*   `--mafft`: (Optional) Automatically performs Multiple Sequence Alignment for each cluster.

---

## 💡 Use Cases

*   **Protein Family Discovery**: Organizing large proteomes into evolutionarily related groups.
*   **Short Peptide Classification**: Specifically tuned for the discovery of **Antimicrobial Peptides (AMPs)**, toxins, signaling peptides or others.
*   **Remote Homology Exploration**: Identifying relationships in the "twilight zone" (identity < 30%) where traditional greedy methods fragment families.
*   **Domain-Aware Clustering**: Using HMMER `domtblout` inputs to cluster sequences based on specific functional domains.

---

## 📝 Citation
If you use **clustrX** in your research, please cite:
> Benítez-Prián, M. & San Mauro, D. (2026). clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection.

## 👤 Authors
**Mario Benítez-Prián** & **Diego San Mauro**

Contact: [mario.benitezprian@gmail.com](mailto:mario.benitezprian@gmail.com) | [GitHub](https://github.com/mario-benitez-prian)
