Metadata-Version: 2.4
Name: uht-discovery
Version: 0.2.11
Summary: Semi-automated protein discovery pipeline using BLAST, quality control, and language model clustering
Home-page: https://github.com/Matt115A/uht-discovery
Author: Matthew Penner
Author-email: mp957@cam.ac.uk
License: MIT
Project-URL: Bug Reports, https://github.com/Matt115A/uht-discovery/issues
Project-URL: Source, https://github.com/Matt115A/uht-discovery
Project-URL: Documentation, https://github.com/Matt115A/uht-discovery#readme
Keywords: bioinformatics protein discovery BLAST clustering language models ESM2
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Python: ~=3.10.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy==2.2.6
Requires-Dist: pandas==2.3.3
Requires-Dist: scipy==1.15.3
Requires-Dist: scikit-learn==1.7.1
Requires-Dist: pyyaml==6.0.3
Requires-Dist: tqdm==4.67.1
Requires-Dist: biopython==1.85
Requires-Dist: pysam==0.23.3
Requires-Dist: torch==2.7.1
Requires-Dist: fair-esm==2.0.0
Requires-Dist: matplotlib==3.10.7
Requires-Dist: seaborn==0.13.2
Requires-Dist: plotly==6.2.0
Requires-Dist: nicegui==2.12.1
Requires-Dist: umap-learn==0.5.9.post2
Requires-Dist: statsmodels==0.14.5
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: bcrypt==4.2.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# UHT Discovery

UHT Discovery is a protein discovery pipeline centered on three core steps:

1. `blaster`: retrieve homologous sequences from NCBI
2. `trim`: quality-control and length-filter FASTA sequences
3. `clust` (PLMCLUSTV2): cluster sequences using ESM2 embeddings and NLL scoring

## Installation

### From Source

```bash
git clone <repository-url>
cd uht-discovery-package
pip install -e .
```

### From PyPI

```bash
pip install uht-discovery
```

## Quick Start

### 1. BLASTER

Put one or more query FASTA files in:

```text
inputs/blaster/PROJECT/
```

Run:

```bash
uht-blast --project PROJECT --email your@email.com --hits 500
```

Key options:

- `--project`: project directory name (required)
- `--email`: email for NCBI API usage (required)
- `--hits`: number of hits to retrieve (default: `100`)
- `--db`: `nr`, `swissprot`, or `refseq_protein` (default: `nr`)
- `--evalue`: BLAST E-value cutoff (default: `1e-5`)

Outputs:

- `results/blaster/PROJECT/` with combined FASTA and BLAST report

### 2. TRIM

Put FASTA files in:

```text
inputs/trim/PROJECT/
```

Run (automatic thresholds):

```bash
uht-trim --project PROJECT --auto
```

Run (manual thresholds):

```bash
uht-trim --project PROJECT --low 100 --high 500
```

Key options:

- `--project`: project directory name (required)
- `--auto`: infer thresholds automatically
- `--low` / `--high`: manual inclusive length thresholds

Outputs:

- `results/trim/PROJECT/` with filtered FASTA files, removed-sequence logs, and QC plots

### 3. PLMCLUSTV2

Put FASTA files in:

```text
inputs/plmclustv2/PROJECT/
```

Run:

```bash
uht-clust --project PROJECT --clusters auto
```

Or fixed cluster count:

```bash
uht-clust --project PROJECT --clusters 6
```

Key options:

- `--project`: project directory name (required)
- `--clusters`: integer cluster count or `auto`
- `--sil-min` / `--sil-max`: silhouette search bounds for auto mode
- `--keep-separate`: process each FASTA file independently

Outputs:

- `results/plmclustv2/PROJECT/` including cluster FASTAs, metrics CSVs, representative sequences, and visualization files

## Recommended Workflow

Run the pipeline in this order:

1. `uht-blast`
2. `uht-trim`
3. `uht-clust`

## Help

```bash
uht-blast --help
uht-trim --help
uht-clust --help
```
