Metadata-Version: 2.4
Name: hapnet
Version: 0.2.1
Summary: Population-aware, MST-based haplotype graph analysis and visualization in Python.
Author: Andrew A. Davinack
License-Expression: MIT
Project-URL: Homepage, https://github.com/parasiteguy/hapnet
Project-URL: Repository, https://github.com/parasiteguy/hapnet
Project-URL: Issues, https://github.com/parasiteguy/hapnet/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython>=1.80
Requires-Dist: networkx>=3.0
Requires-Dist: matplotlib>=3.7
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.9
Dynamic: license-file

# HapNet

HapNet is a lightweight Python command-line package for constructing population-aware, minimum-spanning-tree-based haplotype graphs from aligned FASTA files.

HapNet is designed for reproducible single-locus workflows. It collapses identical aligned sequences into haplotypes, calculates pairwise Hamming distances among haplotypes, constructs a minimum spanning tree (MST), plots a population-aware haplotype graph, and writes machine-readable TSV summaries describing haplotype composition, population membership, individual membership, shared/private haplotypes, and run metadata.

**Important scope note:** HapNet constructs an MST-based haplotype graph. It does **not** infer median-joining, statistical-parsimony, or reticulate haplotype networks.

## Installation

```bash
pip install hapnet
```

To check the installed version:

```bash
hapnet --version
```

## Basic usage

For a standard aligned FASTA file in which population identity is encoded in the sequence header, run:

```bash
hapnet input.fasta --out network.svg --log-prefix run1
```

This writes a network figure and a set of TSV output files using the prefix `run1`.

For a cleaner publication-style figure without haplotype labels inside nodes, use:

```bash
hapnet input.fasta --out network_unlabeled.svg --log-prefix run1 --hide-labels
```

Haplotype identities remain documented in the output tables even when labels are hidden from the figure.

## Standard input format

By default, HapNet parses population identity from the final underscore-delimited token in each FASTA header.

Example:

```text
>Ind01_NK
ACTGACTG
>Ind02_RI
ACTGATTA
>Ind03_RI
ACTGATTA
```

In this example, `Ind01_NK` is interpreted as individual or sequence `Ind01` from population `NK`.

Run:

```bash
hapnet examples/basic/basic.fasta --out basic.svg --log-prefix basic
```

## Phased diploid input

HapNet v0.2.0 adds optional support for already phased diploid sequences. HapNet does **not** infer phase. It preserves individual identity across phased allele copies generated by external software.

Header format in phased mode:

```text
>Ind01_a_NK
ACTGACTG
>Ind01_b_NK
ACTGATTA
>Ind02_a_RI
ACTGACTG
>Ind02_b_RI
ACTGACTG
```

Run:

```bash
hapnet examples/phased/phased_example.fasta --phased --out phased.svg --log-prefix phased
```

This writes all standard HapNet output files plus an individual-level genotype table:

```text
phased_individual_genotypes.tsv
```

This table summarizes the haplotypes carried by each phased individual and classifies individuals as homozygous or heterozygous with respect to the inferred haplotypes.

## Metadata file option

Instead of encoding population and allele information directly in FASTA headers, users can provide a tab-delimited metadata file.

Example metadata file:

```text
sequence_id	individual_id	allele	population
Ind01_a	Ind01	a	NK
Ind01_b	Ind01	b	NK
Ind02_a	Ind02	a	RI
Ind02_b	Ind02	b	RI
```

Run:

```bash
hapnet input.fasta --metadata metadata.tsv --phased --out network.svg --log-prefix run1
```

Metadata values override header parsing. This option is useful when FASTA headers do not follow the default underscore-delimited format.

## Ambiguous and missing data

By default, ambiguous characters and gaps are treated as literal character states during Hamming-distance calculation. To ignore positions containing `N`, `?`, `-`, or `.` in either sequence during pairwise comparisons, use:

```bash
hapnet input.fasta --ignore-ambiguous --out network.svg --log-prefix run1
```

This option may be useful for alignments containing missing data, ambiguous base calls, or gaps.

## Figure-label options

By default, HapNet displays haplotype labels such as `H1`, `H2`, and `H3` inside network nodes.

To hide haplotype labels for a cleaner publication figure:

```bash
hapnet input.fasta --out network_unlabeled.svg --log-prefix run1 --hide-labels
```

To include haplotype counts inside node labels:

```bash
hapnet input.fasta --out network_counts.svg --log-prefix run1 --show-counts-in-label
```

The figure-label options affect only the network image. Haplotype identities, sequence membership, and population membership are still recorded in the TSV output files.

## Main output files

For `--log-prefix run1`, HapNet writes:

* `run1_haplotypes.tsv`: haplotype IDs, frequencies, population composition, and sequences
* `run1_membership.tsv`: sequence-to-haplotype membership
* `run1_shared_haplotypes.tsv`: haplotypes shared among two or more populations
* `run1_haplotype_individuals.tsv`: individuals represented in each haplotype
* `run1_summary.tsv`: summary statistics, including number of sequences, haplotypes, shared haplotypes, private haplotypes, and largest haplotype size
* `run1_run_metadata.tsv`: run metadata for reproducibility, including HapNet version, input file, output file, algorithm, distance metric, and parameter settings
* `run1_individual_genotypes.tsv`: phased individual genotype table, written only when `--phased` is used

## Scripted workflow example

HapNet can be integrated into a shell workflow that processes many aligned FASTA files. For example, the following loop generates one network figure and one set of output tables for every FASTA file in an `alignments/` directory:

```bash
mkdir -p networks results

for fasta in alignments/*.fasta
do
    base=$(basename "$fasta" .fasta)

    hapnet "$fasta" \
      --out "networks/${base}_network.svg" \
      --log-prefix "results/${base}" \
      --hide-labels
done
```

This workflow is useful for comparative barcoding, phylogeographic, taxonomic, or teaching datasets where many aligned single-locus FASTA files need to be processed consistently.

## Example: standard mitochondrial workflow

A typical mitochondrial COI workflow uses an aligned FASTA file with population identity encoded in the sequence headers.

```bash
hapnet examples/polydora_websteri/Pwebsteri_revised.fasta \
  --out examples/polydora_websteri/polydorawebsteri_network.svg \
  --log-prefix examples/polydora_websteri/polydorawebsteri
```

For a cleaner publication-style figure:

```bash
hapnet examples/polydora_websteri/Pwebsteri_revised.fasta \
  --out examples/polydora_websteri/polydorawebsteri_network_unlabeled.svg \
  --log-prefix examples/polydora_websteri/polydorawebsteri \
  --hide-labels
```

## Command-line help

To see all available options:

```bash
hapnet --help
```

## Citation

If you use HapNet, please cite the associated manuscript when available.
