Timepiece VII: Threads

Threading Instructions for Ancestral Recombination Graphs

“Threads takes as input a set of phased genotypes and outputs a set of threading instructions for each sample.”

—Brandt, Chiang, Guo et al. (2024)

The Mechanism at a Glance

Threads is a deterministic method for inferring Ancestral Recombination Graphs from phased genotype data. Where SINGER (Timepiece VII: SINGER) samples ARGs from a posterior distribution using Bayesian MCMC, Threads finds the single most likely threading path for each haplotype using a three-step pipeline: pre-filter candidate matches, run a memory-efficient Viterbi algorithm, and date the resulting segments.

Primary Reference

[Gunnarsson et al., 2024]

The three gears of Threads:

  1. PBWT Haplotype Matching (the pre-filter) – Uses the positional Burrows-Wheeler transform to identify a small set of candidate haplotype matches for each sample, reducing the search space from \(O(N)\) to \(O(L)\) candidates per sample (\(L \ll N\)).

  2. Memory-Efficient Viterbi (the inference engine) – A branch-and-bound implementation of the Viterbi algorithm under the Li-Stephens model that finds the optimal threading path in \(O(NM)\) time and \(O(N)\) average memory, avoiding the classical \(O(NM)\) memory requirement.

  3. Segment Dating (the calibration) – Assigns coalescence times to each Viterbi segment using likelihood-based and Bayesian estimators that model segments as pairwise IBD regions.

These gears connect in a simple linear pipeline:

Phased genotypes (N haplotypes, M sites)
                 |
                 v
+-------------------------------+
|   PBWT HAPLOTYPE MATCHING     |
|                               |
|  Chunk genome into 0.5 cM     |
|  segments, sort with PBWT,    |
|  query L-neighbourhood,       |
|  filter by match count        |
|                               |
|  -> L candidates per sample   |
+-------------------------------+
                 |
                 v
+-------------------------------+
|   MEMORY-EFFICIENT VITERBI    |
|                               |
|  For each sample n = 1..N:    |
|    Run Li-Stephens Viterbi    |
|    on L x M candidate panel   |
|    using branch-and-bound     |
|                               |
|  -> Threading targets per     |
|     sample at each site       |
+-------------------------------+
                 |
                 v
+-------------------------------+
|   SEGMENT DATING              |
|                               |
|  For each Viterbi segment:    |
|    Model as IBD region        |
|    Estimate age from length   |
|    + mutations + demography   |
|                               |
|  -> Coalescence times         |
+-------------------------------+
                 |
                 v
      Threading instructions
      (= the inferred ARG)

Prerequisites for this Timepiece

Threads builds on several earlier concepts and Timepieces:

  • Li & Stephens HMM – the copying model that Threads optimizes with its memory-efficient Viterbi

  • Coalescent Theory – coalescence times and rates used in the dating step

  • The SMC – the sequentially Markov coalescent model underlying the segment length distribution

If you have worked through the Li & Stephens Timepiece, you already have the main conceptual foundation. Threads extends that foundation to biobank-scale data through algorithmic innovation rather than model complexity.

Chapters