Metadata-Version: 2.4
Name: BunyaminMSA
Version: 1.0.0
Summary: Pure Python Clustal Omega Multiple Sequence Alignment implementation
Home-page: https://github.com/bunyaminarpc/BunyaminMSA
Author: Bunyamin Arpc
Keywords: bioinformatics multiple sequence alignment clustal omega MSA
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: requires-python
Dynamic: summary

# BunyaminMSA

**Pure Python implementation of the Clustal Omega Multiple Sequence Alignment (MSA) algorithm.**

> Bioinformatics Final Project — Bunyamin Arpc

---

## Installation

```bash
pip install BunyaminMSA
```

Or from source:

```bash
git clone https://github.com/bunyaminarpc/BunyaminMSA.git
cd BunyaminMSA
pip install -e .
```

---

## Quick Start

```python
from bunyaminmsa import ClustalOmega

msa = ClustalOmega()

sequences = ["ACGTACGT", "ACGGACGT", "TTTTACGT"]
names     = ["Human", "Mouse", "Zebrafish"]

result = msa.align(sequences, names=names)
print(result["alignment_str"])
```

### FASTA Input

```python
fasta = """
>seq1
ACGTACGTACGT
>seq2
ACGGACGTACGG
>seq3
TTTTACGTATTT
"""
result = msa.align_from_fasta(fasta)
```

### Command Line

```bash
bunyaminmsa --fasta input.fasta
bunyaminmsa --seqs ACGT ACGG TTTT --names s1 s2 s3
bunyaminmsa --fasta input.fasta --output alignment.aln
```

---

## Algorithm Overview

Clustal Omega performs MSA in three main stages:

### 1. Pairwise Distance Calculation (k-mer based)
All sequence pairs are compared using k-mer frequency profiles and cosine distance. This is faster than full pairwise DP and robust to long sequences.

### 2. Guide Tree Construction (UPGMA)
The pairwise distance matrix is used to build a binary guide tree using **UPGMA** (Unweighted Pair Group Method with Arithmetic mean). Closely related sequences are merged first.

### 3. Progressive Alignment
Sequences are aligned following the guide tree (post-order traversal):
- **Leaf–Leaf**: Needleman-Wunsch global alignment with affine gap penalties
- **Profile–Profile**: Frequency profiles are built for each aligned group; alignment proceeds between profiles column-by-column

---

## API Reference

### `ClustalOmega`

| Method | Description |
|--------|-------------|
| `align(sequences, names=None)` | Align list of sequences |
| `align_from_fasta(fasta_text)` | Parse FASTA string and align |
| `get_distance_matrix()` | Return last computed distance matrix |
| `get_guide_tree()` | Return last computed guide tree |

### Result Dictionary

| Key | Type | Description |
|-----|------|-------------|
| `names` | list[str] | Sequence names |
| `aligned` | list[str] | Aligned sequences (with gaps) |
| `alignment_str` | str | CLUSTAL-format alignment |
| `distance_matrix` | list[list[float]] | n×n pairwise distances |
| `sequence_type` | str | `'dna'` or `'protein'` |
| `guide_tree` | str | String representation of UPGMA tree |

---

## Running Tests

```bash
python tests/test_clustal_omega.py
# or
pytest tests/
```

---

## License

MIT License — Bunyamin Arpc
