Metadata-Version: 2.4
Name: midsplit
Version: 0.1.0
Summary: Demultiplex multi-reference BAM files into per-reference buckets and call consensus
Requires-Python: >=3.10
Requires-Dist: dark-matter>=7.1.19
Requires-Dist: numpy>=2.0
Requires-Dist: pysam>=0.22
Description-Content-Type: text/markdown

# midsplit

Demultiplex a multi-reference BAM file into per-reference buckets and call a
consensus sequence for each non-empty bucket.

## What it does

When reads are aligned to multiple reference sequences in a single BAM file,
midsplit assigns each read (or read pair) to the reference(s) it matches best,
then produces a separate BAM, consensus FASTA, and per-site statistics file for
each reference that received at least one read.

The classification uses the NM (edit-distance) tag to compute a percent
identity for each alignment.  A read is assigned to a reference if its percent
identity is at least `--threshold` times the best percent identity that read
achieves across all references (default 0.95).  Paired-end reads are treated as
a unit using an overlap-aware combined percent identity, so both mates always
land in the same bucket(s).

## Output

For each reference that receives reads, midsplit writes:

| File | Contents |
|------|----------|
| `{ID}.bam` / `{ID}.bam.bai` | Sorted, indexed per-reference BAM |
| `{ID}-consensus.fasta` | Consensus sequence called by ivar |
| `{ID}-per-site.tsv` | Per-position depth, A/C/G/T counts, ref base, and consensus base |
| `summary.txt` | Run-level statistics and consensus-vs-reference comparison |

The per-site TSV has columns: `site`, `ref_base`, `consensus_base`, `depth`,
`A`, `C`, `G`, `T`.  When `--align` is used, the consensus base is mapped back
to the correct reference position even when ivar has inserted or deleted bases
relative to the reference.

## Usage

```
midsplit [options] INPUT_BAM
```

### Options

| Option | Default | Description |
|--------|---------|-------------|
| `--output-dir DIR` | `.` | Directory for all output files (created if absent) |
| `--threshold FLOAT` | `0.95` | Minimum fraction of best PID to assign a read |
| `--reference FASTA` | — | Multi-reference FASTA; enables `ref_base` column and consensus comparison |
| `--align` | off | Align consensus to reference before comparison (recommended when lengths differ) |
| `--aligner` | `mafft` | Aligner for `--align`: `mafft`, `needle`, or `edlib` |
| `--aligner-options OPTIONS` | — | Extra options forwarded to the aligner (implies `--align`) |
| `--consensus-quality INT` | `20` | Minimum base quality passed to ivar (`-q`) |
| `--consensus-frequency-threshold FLOAT` | `0.0` | Minimum frequency for ivar to call a base (`-t`) |
| `--consensus-low-coverage INT` | `0` | Depth below which ivar masks with N (`-m`) |
| `--consensus-id TEMPLATE` | — | ID for the consensus sequence; use `{ID}` to embed the reference name |

### Example

```bash
midsplit \
  --reference references/multi.fasta \
  --output-dir results/ \
  --align \
  --threshold 0.95 \
  alignments/reads-vs-multi.bam
```

## Requirements

- Python 3.11+
- [samtools](https://www.htslib.org/) and [ivar](https://andersen-lab.github.io/ivar/html/) on `PATH`
- Python dependencies are managed with [uv](https://github.com/astral-sh/uv); run `uv sync` to install them

## Notes

- Only primary and secondary alignments are used; supplementary alignments
  (chimeric/split reads) are skipped.
- Reads aligned with Bowtie2 `--all` or `-k N` emit non-best hits as secondary
  alignments; midsplit includes these in classification so that all alignment
  evidence is used.
- For circular genomes (e.g. HBV) mapped against a linearised reference,
  local alignment (`bowtie2 --very-sensitive-local`) is strongly recommended
  over end-to-end alignment.  End-to-end mode cannot soft-clip reads that span
  the linearisation junction, which introduces artefactual bases near position 1
  of the reference and can corrupt the consensus at those positions.
