midsplit — session transcript (high-level)
==========================================

Session 1  (2026-04-16)
-----------------------

GOAL
  Build a Python CLI tool called midsplit that reads a multi-reference BAM
  file, classifies reads into per-reference buckets by alignment quality,
  writes a per-reference BAM for each non-empty bucket, calls consensus for
  each bucket, and produces coverage/soft-clip statistics.

DESIGN DISCUSSION
  Two rounds of clarifying questions settled the following decisions:

  - Thresholding: use the XM tag (mismatch count) to compute percent identity
    as (query_length - XM) / query_length. Assign a read to a reference if its
    pid >= threshold * best_pid_for_that_read across all its alignments.
    Single-reference reads are always assigned regardless of score.

  - Skip non-primary alignments (SAM flags 256/2048).

  - Missing XM tag: print an error message and exit.

  - Soft-clip stats:
      "before" = all primary alignments to a reference (including reads that
                 end up assigned elsewhere)
      "after"  = only the reads actually placed in that reference's bucket

  - Consensus caller: ivar by default, via the pipeline:
      samtools mpileup -d 0 -aa -A -Q 0 {bam} | ivar consensus -p {prefix} \
        -q {quality} -t {freq_threshold} -m {min_depth}
    ivar's .fa output is renamed to .fasta and placed in the output directory
    as {accession}-consensus.fasta.

  - CLI options: --threshold (0.95), --consensus-caller (ivar),
    --consensus-quality (20), --consensus-frequency-threshold (0),
    --consensus-low-coverage (0).

  - Output: per-reference sorted+indexed BAM, {accession}-consensus.fasta,
    summary.txt (plain text), summary.xlsx (one sheet per reference).

  - Coverage stats computed in-process during the second BAM pass using
    per-position depth arrays.

IMPLEMENTATION
  Files created:
    pyproject.toml                  updated with pysam, numpy, openpyxl deps
                                    and midsplit entry point
    src/midsplit/__init__.py
    src/midsplit/classify.py        pure classification logic (no pysam)
    src/midsplit/stats.py           ReferenceAccumulator + compute_stats
    src/midsplit/consensus.py       ivar subprocess wrapper
    src/midsplit/writer.py          two-pass BAM processing pipeline
    src/midsplit/report.py          summary.txt and summary.xlsx writers
    src/midsplit/cli/__init__.py
    src/midsplit/cli/main.py        argparse entry point
    tests/__init__.py
    tests/conftest.py               make_bam() helper + two_ref_bam fixture
    tests/test_classify.py          19 unit tests for classify module
    tests/test_stats.py             12 unit tests for stats module
    tests/test_writer.py            12 integration tests for process_bam()

  All 43 tests pass.

SESSION 2  (2026-04-16) — paired-end / overlapping reads
---------------------------------------------------------

DESIGN DISCUSSION
  Considered how to handle paired-end reads. Decided on Option C:
  treat the pair as a single unit using a combined percent identity that
  correctly accounts for overlapping reads.

  Formula for compute_combined_pid:
    - Compute the union of reference positions from both reads.
    - For unique-to-r1 positions: use r1's per-position mismatch rate.
    - For unique-to-r2 positions: use r2's per-position mismatch rate.
    - For overlap positions: use the average of both mismatch rates.
    - combined_pid = 1 - expected_mismatches / len(union).
  For single-end reads (or one mate unmapped): degrades to (n_aligned - XM) / n_aligned.

IMPLEMENTATION
  Files changed:
    src/midsplit/classify.py   removed ReadAlignmentSummary; added
                               PairAlignmentSummary and compute_combined_pid;
                               updated classify_reads signature.
    src/midsplit/writer.py     Pass 1 now collects _AlignmentRecord objects
                               (including get_reference_positions()); new
                               _build_pair_summaries() groups by
                               (query_name, ref_name) and computes combined PIDs.
                               Pass 2 unchanged.
    tests/conftest.py          Added paired_two_ref_bam and overlapping_pairs_bam
                               fixtures.
    tests/test_classify.py     Added TestComputeCombinedPid (13 tests);
                               updated classify_reads tests to use
                               PairAlignmentSummary.
    tests/test_writer.py       Added TestProcessBamPaired and
                               TestProcessBamOverlapping (9 new tests).

  All 66 tests pass.

KNOWN LIMITATION
  For paired-end reads, mate reference IDs are not remapped in the
  per-reference output BAMs. This is harmless for consensus calling but
  the BAMs are not fully spec-compliant for paired data.

SESSION 3  (2026-04-16) — transcript maintenance
-------------------------------------------------
  User asked to update both transcript files. Files were already current
  through Session 2; this exchange appended to both.

SESSION 4  (2026-04-18) — next_reference_id bug on real BAM
------------------------------------------------------------

PROBLEM
  Running midsplit on a real paired-end BAM (VK211-against-ADG.bam) caused
  samtools sort to report "truncated file" / "Result too large" (htslib -3 /
  ERANGE) on the per-reference output BAMs.

ROOT CAUSE
  The per-reference output BAMs have a single @SQ entry (n_targets = 1).
  htslib validates both reference_id and next_reference_id against n_targets.
  The code correctly set reference_id = 0 but left next_reference_id unchanged
  from the multi-reference input BAM (values such as 1, 2, …), which are >= 1
  and thus out of range for a single-target BAM.

FIX
  In writer.py Pass 2, after setting segment.reference_id = 0:
    - If next_reference_id == original reference_id (mate on same ref): set 0.
    - If next_reference_id >= 0 but different ref: set -1 (unmapped / unknown).
    - Otherwise (already -1): leave unchanged.
  Original IDs are saved before mutation and restored after writing so that
  later iterations over the same segment object are unaffected.

  No new tests were added (the existing 66 integration tests continued to pass;
  the failure was only observable with real multi-reference paired-end data).

SESSION 5  (2026-04-18) — assignment breakdown + extended depth stats
----------------------------------------------------------------------

GOAL
  Report how many reads were assigned solely to one reference versus assigned
  to multiple references. Add median, min, max, and std-dev depth columns.

DESIGN
  reads_sole / reads_multi are counted in query-name (pair) units to match the
  classification logic, while read_count remains individual alignment records.

IMPLEMENTATION
  Files changed:
    src/midsplit/classify.py   Added compute_assignment_stats() returning a
                               dict[str, tuple[int, int]] (sole, multi) per ref.
    src/midsplit/stats.py      Added reads_sole and reads_multi fields to both
                               ReferenceAccumulator and ReferenceStats; added
                               median_depth, min_depth, max_depth, depth_std to
                               ReferenceStats and compute_stats().
    src/midsplit/writer.py     Calls compute_assignment_stats() after
                               classification and passes sole/multi counts to
                               each ReferenceAccumulator.
    src/midsplit/report.py     Text report: added assignment breakdown section
                               and all four new depth stats.
                               Excel report: expanded from 9 to 13 columns
                               (added Sole Assignments, Multi Assignments,
                               Median Depth, Depth Std Dev, Min Depth,
                               Max Depth).
    tests/test_classify.py     Added TestComputeAssignmentStats (5 tests).
    tests/test_stats.py        Added 6 tests for new accumulator and stats
                               fields (median, min, max, std, sole, multi).
    tests/test_writer.py       Added test_assignment_breakdown and related
                               checks.

  All 78 tests pass.

SESSION 6  (2026-04-18) — per-site nucleotide output
-----------------------------------------------------

GOAL
  For each reference, write a TSV file with one row per genome site containing:
  site number (1-based), reference base, consensus base, total coverage depth,
  and individual A / C / G / T counts.

DESIGN
  - depth counts all reads spanning a position (including deletions), matching
    samtools mpileup behaviour.
  - A/C/G/T counts exclude deletion positions.
  - Reference bases come from an optional --reference FASTA (multi-reference).
    Without it, the column is filled with 'N'.
  - Consensus bases are read from the ivar consensus FASTA produced earlier.
    Without it, the column is filled with 'N'.
  - Output file: {output_dir}/{accession}-per-site.tsv.

IMPLEMENTATION
  Files created:
    src/midsplit/persite.py    SiteInfo dataclass; read_fasta_sequence();
                               compute_per_site() using pysam pileup
                               (max_depth=1_000_000, min_base_quality=0,
                               ignore_overlaps=False); write_per_site_tsv().
    tests/test_persite.py      26 tests across TestReadFastaSequence (9),
                               TestComputePerSite (12), TestWritePerSiteTsv (5).

  Files changed:
    src/midsplit/cli/main.py   Added --reference FASTA optional argument;
                               consensus FASTA paths now collected in a dict
                               during the consensus loop; new per-site block
                               reads ref length from the output BAM header and
                               calls compute_per_site() + write_per_site_tsv()
                               for each reference.

  All 104 tests pass.

SESSION 7  (2026-04-19) — housekeeping: remove Excel, optional output dir
--------------------------------------------------------------------------

CHANGES
  - Removed Excel (summary.xlsx) output entirely; openpyxl dependency dropped.
  - --output-dir changed from a positional argument to an optional flag
    defaulting to '.'.

  Files changed: cli/main.py, report.py, pyproject.toml.
  All 104 tests pass.

SESSION 8  (2026-04-19) — run-level summary stats + units mismatch fix
------------------------------------------------------------------------

GOAL
  Add a table to summary.txt showing how many reads matched 1 reference,
  how many matched 2 references, etc. Also fix a confusing unit mismatch
  where the per-reference "Reads" count was alignment segments while the
  run-level total was query IDs.

DESIGN
  RunSummary dataclass added to stats.py:
    total_alignments  — number of alignment segments (rows in the BAM)
    total_reads       — number of unique query IDs
    refs_per_read     — Counter: n_refs → read count (sorted)
  Per-reference "Read IDs assigned" = reads_sole + reads_multi (pair-unit).
  Summary header now shows both totals with clear labels.

  Files changed: stats.py, writer.py, report.py, cli/main.py.
  process_bam() now returns a 3-tuple (stats, ref_bam_paths, run_summary).
  All 104 tests pass (test_writer.py updated to unpack 3-tuple).

SESSION 9  (2026-04-19) — rename "read name" → "read ID"
----------------------------------------------------------

  Replaced all occurrences of "read name" / "read names" with "read ID" /
  "read IDs" throughout report.py, writer.py, and test files.

SESSION 10  (2026-04-19) — consensus vs reference comparison
-------------------------------------------------------------

GOAL
  Include a summary of how well the consensus sequence matches the reference,
  using dark-matter's compareDNAReads / matchToString.

DESIGN
  report.py gains _write_consensus_comparison() which:
    - Reads the reference sequence (from --reference FASTA) and the consensus
      FASTA produced by ivar.
    - Calls compareDNAReads(ref_read, cons_read) then matchToString(...,
      includeGapLocations=False, includeNoCoverageLocations=False).
    - Emits a *** WARNING *** (to file and stderr) if sequences have different
      lengths and --align was not used.
  write_text_report() accepts optional reference_fasta, consensus_fastas,
  align, aligner, aligner_options keyword arguments.

SESSION 11  (2026-04-19) — --align / --aligner / --aligner-options
-------------------------------------------------------------------

GOAL
  Allow the user to align the consensus to the reference before comparison
  (required when they have different lengths, or useful in any case).

DESIGN
  Mirrored the compare-sequences.py interface from dark-matter:
    --align            store_true; run aligner before comparison
    --aligner          choices: edlib, mafft, needle (default mafft)
    --aligner-options  string of extra aligner flags; implies --align
  Prints sequence lengths before and after alignment.
  Uses dark.aligners.align(); imports MAFFT_DEFAULT_ARGS, NEEDLE_DEFAULT_ARGS.

  Files changed: cli/main.py, report.py.
  All 104 tests pass.

SESSION 12  (2026-04-19) — Bowtie2 secondary alignment + NM tag
----------------------------------------------------------------

PROBLEM
  Bowtie2 uses the secondary alignment flag (0x100) for every non-best hit
  when run with -k N or -a.  The original code skipped secondary alignments,
  which meant per-reference hits beyond the top-1 were silently discarded.

FIX
  Removed is_secondary from the skip guard in writer.py Pass 1 and Pass 2.
  Only supplementary alignments (0x800, chimeric / split reads) are now
  skipped.

NM TAG SWITCH
  Switched from the XM tag (Bowtie2-specific mismatch count) to the NM tag
  (standard edit distance, produced by Bowtie2, BWA, and others).
  All references to XM renamed to NM throughout classify.py, writer.py,
  conftest.py, and all test files.

  TestRunSummary and TestSecondaryAlignments test classes added to
  test_writer.py covering:
    - total_reads / total_alignments counts
    - refs_per_read distribution
    - secondary alignments are classified (not skipped)
    - supplementary alignments are still skipped

  All 110 tests pass.

SESSION 13  (2026-04-20) — multi-mapping within reference (fred.bam bug)
-------------------------------------------------------------------------

PROBLEM
  Running midsplit on fred.bam produced a warning that read
  A00706:1017:HNH5NDSXF:2:2337:28809:28463 had "multiple primary matches"
  against a reference, but on inspection only secondary alignments existed for
  that read against that reference (Bowtie2 -k mode writes all non-best hits
  as secondary).  The real issue: a single read had two secondary alignments
  to different positions within the same reference, and _build_pair_summaries
  was treating the two same-mate records as an r1+r2 pair, computing a
  nonsensical combined PID.

ROOT CAUSE
  The by_mate deduplication block in _build_pair_summaries was guarded by
  `if len(records) > 2:`, so exactly-2 same-mate cases fell through to the
  paired-PID branch incorrectly.

FIX
  Removed the guard; the by_mate grouping (keep lowest-NM record per mate
  strand) is now unconditional, so groups of 1 or 2 same-mate records are
  reduced to a single record before the paired-vs-single branch.

  test_multi_mapping_within_reference_uses_best_alignment added to
  TestSecondaryAlignments.

  All 114 tests pass.

SESSION 14  (2026-04-20) — per-site consensus_base offset bug
-------------------------------------------------------------

PROBLEM
  In VK211/OUT-e2e/KJ470898.1-per-site.tsv, site 1516 showed consensus_base=K
  (IUPAC G/T) despite a depth of 13 with all reads showing A.

ROOT CAUSE
  ivar incorporates insertions from reads into the consensus FASTA output,
  making it longer than the reference (3183 bp vs 3182 bp for KJ470898.1).
  compute_per_site() was naively mapping cons_seq[i] → sites[i], so every
  reference position after the first insertion in the consensus was shifted by
  the number of preceding insertions.

  At site 1516 specifically: there was a net +1 insertion somewhere earlier in
  the consensus, so cons_seq[1515] actually held K — the correct consensus for
  reference position 1515 (reads show 5 G + 5 T + 2 A → K after quality
  filtering), not position 1516.  The true consensus for position 1516 (all A)
  sat at cons_seq[1516] and was incorrectly displayed for site 1517.

FIX
  Added align, aligner, and aligner_options parameters to compute_per_site().
  When align=True (and reference_fasta is provided), a new helper
  _consensus_bases_via_alignment() aligns the consensus FASTA to the reference
  using dark.aligners.align(), then walks the aligned pair skipping columns
  where the reference is '-' (insertions).  This produces a correctly-mapped
  list of one consensus base per reference position.

  cli/main.py updated to forward args.align, args.aligner, and
  args.aligner_options to compute_per_site().

  One new test added to TestComputePerSite:
    test_consensus_bases_via_alignment_corrects_insertion_offset — patches
    _consensus_bases_via_alignment to verify that align=True calls the helper
    and uses its result instead of the naive positional mapping.

  All 115 tests pass.

SESSION 15  (2026-04-20) — ivar insertion investigation
--------------------------------------------------------

INVESTIGATION
  User asked why ivar incorporated only a single net insertion given that the
  pileup showed insertion markers at many positions, what evidence supports a
  real insertion, and whether ivar is behaving correctly.

FINDINGS
  The consensus is 3183 bp vs the 3182 bp reference — net +1.  The extra base
  lands at/before position 1 of the alignment (mafft places the gap there).

  Examining positions 3–6 of the pileup (the only early positions with
  insertions):
    - Position 3 (depth 10): 2 reads have +1C, 2 reads have +11CAGTGGAACTC
      (40% of reads carry some insertion).
    - Position 4 (depth 10): 1 read has +2CC, 3 reads have +6AACTCC
      (40% of reads carry some insertion).
  The inserted sequences are inconsistent across reads (different sizes and
  sequences at the same position), which is strong evidence against a real
  biological insertion.  A genuine insertion relative to the reference would
  produce the same inserted sequence in all supporting reads.

  The most likely explanation is a mapping artefact from the circular HBV
  genome: reads that span the arbitrary linearisation junction appear near
  position 1 with dangling bases that the aligner represents as insertions.

  Testing with -t 0.5 produces a consensus of 3185 bp (longer, not shorter),
  showing that ivar's insertion/deletion calling at these low-coverage,
  noisy positions is inconsistent regardless of threshold.

CONCLUSION
  ivar is not doing the right thing with -t 0.0.  The threshold means "include
  any insertion seen in any read", which picks up noise.  For HBV analysis the
  low-coverage start/end of the linearised genome should be masked with a higher
  -m value, and a majority-rule threshold (-t 0.5) should be used.  The
  alignment-based fix to compute_per_site() is still correct and necessary
  independent of this issue.

SESSION 16  (2026-04-21) — VSL vs e2e: insertion absence and consensus comparison
-----------------------------------------------------------

INVESTIGATION
  User asked for a position-by-position comparison of the VSL and e2e consensus
  sequences and an assessment of which makes the better call at each difference.

FINDINGS
  Aligned the two consensuses with mafft; found 17 differences (1 insertion in
  e2e + 16 SNPs).  Examined quality-filtered (q≥20) base counts from both BAMs
  at each differing position.

  Positions 1–2: VSL has 62–64 reads, all calling the reference base.  e2e has
  4–9 reads with artefact junction reads flipping the plurality.  VSL correct.

  Positions 29, 36, 37, 40, 43: Most striking pattern.  VSL and e2e calls are
  opposite at all five positions and VSL agrees with the reference each time.
  In VSL the reference base is the clear plurality (63–71%); in e2e the wrong
  base is the plurality (63–65%).  Junction-spanning reads that e2e cannot
  soft-clip carry bases from a different part of the circular HBV genome and
  overwhelm the real signal in the shallower e2e pileup.

  Positions 132, 190, 1437, 1896: Same pattern at lower magnitude; VSL agrees
  with reference, e2e does not.

  Position 987: VSL 24A/17G → A (matches reference); e2e 3A/13G → G.  e2e
  almost certainly wrong at very low depth.

  Position 1124: Reference G; VSL calls A (25A/22G, marginal), e2e calls G
  (17G/6A).  Genuine ambiguity; neither consensus is clearly right.

  Position 1258: VSL 20C/16T → C (matches reference); e2e 10C/13T → T.  VSL
  wins by having more reads.

  Position 1515: VSL 16G/2T/1A → G (84%, matches reference); e2e 3G/3T/1A →
  K (G/T tie on 7 reads).  This is the position immediately before site 1516;
  the K here was the source of the offset bug investigated in Session 14.

  Position 3182: VSL 59A → unambiguous A; e2e 1A/1C → M (near-zero coverage
  at the end of the reference in e2e).

CONCLUSION
  VSL is the clearly better consensus at 15 of the 16 differing positions;
  position 1124 is a genuine draw.  Every e2e error traces back to the same
  cause: junction-spanning artefact reads that local alignment soft-clips are
  forced into the pileup by end-to-end mode, polluting a relatively small pool
  of real reads and flipping plurality votes at positions scattered across the
  first ~200 bp of the reference.

SESSION 17  (2026-04-21) — README.md
-------------------------------------

  Wrote README.md describing the project: what midsplit does, the
  classification and paired-read logic, all output files and their columns,
  full CLI options table with an example command, requirements, and practical
  notes including the circular-genome/local-alignment warning.

SESSION 18  (2026-04-21) — restoring verbatim transcript
---------------------------------------------------------

  User complained that transcript-verbose.txt contained summaries rather than
  verbatim assistant responses.  Read the JSONL conversation log to extract
  actual verbatim text from Sessions 14–17 and rewrote those sessions in
  transcript-verbose.txt with the real phrases used (including "The data tells
  a very clear story. Let me summarize position by position." and the full
  position table at positions 29/36/37/40/43).
