Metadata-Version: 2.4
Name: EmreTasdemirClustalOmega
Version: 0.1.0
Summary: Pure-Python Clustal Omega multiple sequence alignment implementation
Author-email: Emre Tasdemir <emre1.tasdemir.58@gmail.com>
License-Expression: MIT
Keywords: bioinformatics,multiple sequence alignment,MSA,clustal,HMM,DNA
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# EmreTasdemirClustalOmega

A pure-Python implementation of the Clustal Omega multiple sequence alignment (MSA) algorithm for DNA sequences. No external dependencies required — only the Python standard library.

---

## Features

- **k-tuple distance matrix** — fast pairwise sequence comparison using shared k-mers
- **mBed embedding** — projects sequences into Euclidean space via reference-sequence distances
- **Bisecting k-means clustering** — groups sequences before tree construction to reduce complexity
- **UPGMA guide tree** — builds a hierarchical guide tree both with and without k-means pre-clustering
- **Profile HMM alignment** — progressive alignment along the guide tree using profile Hidden Markov Models and Viterbi decoding
- **Iterative refinement** — improves the MSA score via repeated leave-one-out realignment (HHAlign style)
- **FASTA input** — reads standard `.fasta` / `.fa` files; also supports interactive manual input

---

## Installation

```bash
pip install EmreTasdemirClustalOmega
```

Requires Python 3.8 or later.

---

## Quick Start

### As a library

```python
from clustalomega import align

sequences = [
    ("seq1", "ATGCTAGCTAGCT"),
    ("seq2", "ATGCTAGCTAGCC"),
    ("seq3", "ATGCTTGCTAGCT"),
    ("seq4", "TTGCTAGCTATCT"),
]

aligned_blocks, names = align(sequences, k=2)

for name, block in zip(names, aligned_blocks):
    print(f"{name:<10} {block}")
```

**Output:**
```
seq1       ATGCTAGCTAGCT
seq2       ATGCTAGCTAGCC
seq3       ATGCTTGCTAGCT
seq4       TTGCTAGCTATCT
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sequences` | `list[tuple[str, str]]` | — | List of `(name, sequence)` tuples |
| `k` | `int` | `3` | k-tuple length for distance calculation |
| `seed` | `int` | `42` | Random seed for reproducibility |
| `print_ile_yazdirma` | `bool` | `False` | Print step-by-step output to stdout |

**Returns:** `(aligned_blocks, names)` where both are lists of strings in the same order.

---

### Verbose mode

```python
aligned_blocks, names = align(sequences, k=2, print_ile_yazdirma=True)
```

This prints the full pipeline output: distance matrix, embedding vectors, clustering steps, guide tree, initial alignment, and refinement progress.

---

### From a FASTA file

```python
from clustalomega._io_6_9 import fasta_oku
from clustalomega import align

sequences = fasta_oku("my_sequences.fasta")
aligned_blocks, names = align(sequences, k=3)

for name, block in zip(names, aligned_blocks):
    print(f"{name:<12} {block}")
```

---

## Command-Line Interface

After installation, run the interactive CLI:

```bash
clustalomega
```

It will ask you to:
1. Choose input method (manual entry or FASTA file)
2. Enter the k-tuple length
3. Run the full pipeline and print all intermediate results

---

## Algorithm Overview

The pipeline mirrors the original Clustal Omega algorithm:

```
1. k-tuple distance matrix
        ↓
2. mBed embedding (reference-based Euclidean projection)
        ↓
3. Bisecting k-means clustering  (⌈√N⌉ clusters)
        ↓
4. UPGMA guide tree
        ├─ per-cluster sub-trees (k-means UPGMA)
        └─ centroid-level super-tree
        ↓
5. Progressive alignment  (Profile HMM + Viterbi)
        ↓
6. Iterative refinement   (HHAlign-style, max 3 rounds)
        ↓
   Final MSA
```

---

## Example: 30-sequence dataset

```
SP score before refinement : -16487
SP score after  refinement : -12210   (gain: +4277)
Alignment length           : 37 columns
```

---

## Project Structure

```
clustalomega/
├── __init__.py              # Public API: align()
├── cli.py                   # Interactive command-line entry point
├── _math_utils_1_5.py       # Math helpers (rounding, logarithm, padding)
├── _io_6_9.py               # FASTA parser and manual input
├── _distance_10_15.py       # k-tuple distance matrix
├── _embedding_16_22.py      # mBed embedding
├── _clustering_23_32.py     # Bisecting k-means
├── _guide_tree_33_43.py     # UPGMA guide tree
├── _alignment_44_53.py      # Profile HMM + Viterbi alignment
└── _refinement_54_60.py     # Iterative refinement + SP scoring
```

---

## License

MIT License — see [LICENSE](LICENSE) for details.

---

## Author

**Emre Taşdemir** — emre1.tasdemir.58@gmail.com
