Metadata-Version: 2.4
Name: Huseyin_Cakir
Version: 0.1.0
Summary: Multiple Sequence Alignment using Dynamic Programming
Author-email: Hüseyin Çakır <huseyincakirr1196@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/huseyincakir/Huseyin_Cakir
Keywords: bioinformatics,sequence alignment,dynamic programming,MSA
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# huseyinmsa

A Python library for **Multiple Sequence Alignment (MSA)** implemented from scratch using **Dynamic Programming**.

## Installation

```bash
pip install huseyinmsa
```

## Quick start

```python
from huseyinmsa import MSA, print_alignment

sequences = {
    "Seq1": "ACGTACGT",
    "Seq2": "ACGACGT",
    "Seq3": "ACGT",
}

msa = MSA()
alignment = msa.align(sequences)
print_alignment(alignment)
print("SP Score:", msa.sp_score(alignment))
```

## Reading / writing FASTA files

```python
from huseyinmsa import MSA, read_fasta, write_fasta, print_alignment

sequences = read_fasta("sequences.fasta")
msa = MSA(match=1, mismatch=-1, gap=-2)
alignment = msa.align(sequences)

print_alignment(alignment)
write_fasta(alignment, "aligned.fasta")
```

## Algorithm

The library implements **progressive multiple sequence alignment** driven entirely by dynamic programming.

### Stage 1 — Pairwise distances (Needleman-Wunsch)

Every pair of input sequences is globally aligned using the classic
**Needleman-Wunsch** O(n·m) DP algorithm.

```
dp[0][j] = j × gap
dp[i][0] = i × gap

dp[i][j] = max(
    dp[i-1][j-1] + sub(seq1[i], seq2[j]),   # match or mismatch
    dp[i-1][j]   + gap,                      # delete from seq1
    dp[i][j-1]   + gap,                      # insert into seq1
)
```

After alignment, **sequence identity** (identical columns / total columns)
is used to convert the score into a distance: `distance = 1 − identity`.

### Stage 2 — Guide tree (UPGMA)

The n×n distance matrix is clustered with **UPGMA**:

1. Each sequence starts as its own cluster.  
2. The two closest clusters are merged; their distance to every other
   cluster is recalculated as the size-weighted average.  
3. Repeat until a single root cluster remains.

The result is a binary tree (e.g. `(('A','B'),'C')`) that tells the aligner
which sequences to merge first — always the most similar ones.

### Stage 3 — Progressive alignment (profile–profile DP)

Sequences are merged bottom-up along the guide tree.  At each merge step,
two blocks of already-aligned sequences (**profiles**) are aligned with a
second DP pass.  The score between two columns is the *average substitution
score* over all non-gap character pairs; gap penalties are scaled by the
fraction of real characters in a column.

## Parameters

| Parameter  | Default | Meaning                           |
|------------|---------|-----------------------------------|
| `match`    | `1`     | Score for identical characters    |
| `mismatch` | `-1`    | Penalty for mismatched characters |
| `gap`      | `-2`    | Linear gap penalty                |

## Scoring

After alignment, the **Sum-of-Pairs (SP) score** is available:

```python
score = msa.sp_score(alignment)
```

It sums the pairwise substitution / gap score across every aligned column
for every pair of sequences.  Higher values indicate a better alignment.

## License

MIT
