Metadata-Version: 2.4
Name: hapnet
Version: 0.2.0
Summary: Population-aware, MST-based haplotype graph analysis and visualization in Python.
Author: Andrew A. Davinack
License-Expression: MIT
Project-URL: Homepage, https://github.com/parasiteguy/hapnet
Project-URL: Repository, https://github.com/parasiteguy/hapnet
Project-URL: Issues, https://github.com/parasiteguy/hapnet/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: biopython>=1.80
Requires-Dist: networkx>=3.0
Requires-Dist: matplotlib>=3.7
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.9

# HapNet

HapNet is a lightweight Python command-line package for constructing population-aware, minimum-spanning-tree-based haplotype graphs from aligned FASTA files.

HapNet is designed for reproducible single-locus workflows. It collapses identical aligned sequences into haplotypes, calculates pairwise Hamming distances among haplotypes, constructs a minimum spanning tree (MST), plots a population-aware haplotype graph, and writes machine-readable TSV summaries.

**Important scope note:** HapNet constructs an MST-based haplotype graph. It does not infer median-joining, statistical-parsimony, or reticulate haplotype networks.

## Installation

```bash
pip install hapnet
```

## Standard input format

By default, population identity is parsed from the last underscore-delimited FASTA header token:

```text
>Ind01_NK
ACTGACTG
>Ind02_RI
ACTGATTA
```

Run:

```bash
hapnet examples/basic/basic.fasta --out basic.svg --log-prefix basic
```

## Phased diploid input

HapNet v0.2.0 adds optional support for already phased diploid sequences. HapNet does **not** infer phase; it preserves individual identity across phased allele copies that were generated by external software.

Header format in phased mode:

```text
>Ind01_a_NK
ACTGACTG
>Ind01_b_NK
ACTGATTA
```

Run:

```bash
hapnet examples/phased/phased_example.fasta --phased --out phased.svg --log-prefix phased
```

This writes an additional individual-level genotype table:

```text
phased_individual_genotypes.tsv
```

## Metadata file option

Instead of encoding population and allele information in headers, users can provide a tab-delimited metadata file:

```text
sequence_id	individual_id	allele	population
Ind01_a	Ind01	a	NK
Ind01_b	Ind01	b	NK
```

Run:

```bash
hapnet input.fasta --metadata metadata.tsv --phased --out network.svg --log-prefix run1
```

## Main output files

For `--log-prefix run1`, HapNet writes:

- `run1_haplotypes.tsv`: haplotype IDs, frequencies, populations, and sequences
- `run1_membership.tsv`: sequence-to-haplotype membership
- `run1_shared_haplotypes.tsv`: haplotypes shared among populations
- `run1_haplotype_individuals.tsv`: individuals represented in each haplotype
- `run1_summary.tsv`: summary statistics
- `run1_run_metadata.tsv`: run metadata for reproducibility
- `run1_individual_genotypes.tsv`: phased individual genotype table, written only with `--phased`

## Ambiguous and missing data

By default, ambiguous characters and gaps are treated as literal character states during Hamming-distance calculation. To ignore positions containing `N`, `?`, `-`, or `.` in either sequence during pairwise comparisons, use:

```bash
hapnet input.fasta --ignore-ambiguous --out network.svg --log-prefix run1
```

## Scripted workflow example

HapNet can be integrated into a shell workflow that processes many aligned FASTA files:

```bash
for fasta in alignments/*.fasta
do
    base=$(basename "$fasta" .fasta)
    hapnet "$fasta" \
      --out "networks/${base}.svg" \
      --log-prefix "results/${base}"
done
```

## Citation

If you use HapNet, please cite the associated manuscript when available.
