Overview of the Li & Stephens HMM

Before assembling the watch, lay out every part and understand what it does.

../../_images/fig_mini_lshmm.png

Li & Stephens HMM at a glance. The haplotype copying model in action: forward and backward probabilities along a sequence of variant sites, Viterbi path recovery identifying which reference haplotype is being copied, the emission and transition structure of the copying HMM, and posterior decoding revealing copying state uncertainty at each position.

The Li & Stephens Hidden Markov Model is a versatile gear that appears in many movements across population genetics. Just as a single escapement mechanism can drive a simple desk clock or a grand complication, the LS-HMM underlies tools as diverse as genotype imputation, haplotype phasing, ancestry painting, and ancestral recombination graph inference. In this chapter we lay out the blueprint – what the model does, why it matters, and how the remaining chapters fit together.

Note

Prerequisites. This chapter assumes you have read the HMM chapter, where we introduced hidden states, observations, transition matrices, emission probabilities, and the forward, backward, and Viterbi algorithms in their general form. We also assume familiarity with the basics of coalescent theory – the idea that a sample of sequences shares a common ancestor, and that recombination causes different genomic positions to have different genealogical trees. If these concepts are new, a good starting point is Wakeley’s Introduction to Coalescent Theory or the coalescent chapter of this book (if available).

What Does the Li & Stephens HMM Do?

Given a reference panel of \(n\) haplotypes and a query haplotype, the Li & Stephens (LS) model finds the most likely way the query was assembled as a mosaic of the reference haplotypes.

\[\text{Input: } H = \{h_1, h_2, \ldots, h_n\} \quad \text{(reference haplotypes)}\]
\[\text{Input: } s = (s_1, s_2, \ldots, s_m) \quad \text{(query haplotype, } m \text{ sites)}\]
\[\text{Output: } Z^* = (Z_1^*, Z_2^*, \ldots, Z_m^*) \quad \text{(copying path through the panel)}\]

Each \(Z_\ell^*\) tells you which reference haplotype was being “copied” at site \(\ell\). Where \(Z\) changes value, a recombination breakpoint occurred. Where the query allele differs from the copied haplotype’s allele, a mutation occurred.

Terminology: Haplotype vs. Genotype

A haplotype is the sequence of alleles on a single chromosome. A genotype is the combination of alleles across both chromosomes at a given site (for diploid organisms). In the haploid LS model, we work directly with haplotypes. In the diploid extension (Chapter 4), we work with genotypes – allele dosages of 0, 1, or 2.

From this model you can compute:

  • The most likely copying path (Viterbi algorithm) – used in haplotype imputation and phasing

  • Posterior copying probabilities (forward-backward algorithm) – the probability that any given reference haplotype is the copying source at any given site

  • The data likelihood – how well the model explains the query, useful for model comparison and parameter estimation

  • Local ancestry – which population the query haplotype was copied from at each position, when reference haplotypes come from labeled populations

Think of the Viterbi algorithm as finding the most likely gear sequence – the single best explanation for how the mechanism ticked through each position along the genome. The forward-backward algorithm, by contrast, gives you a probability distribution over all possible gear configurations at each position.

With this high-level picture in mind, let us turn to the biological motivation that makes this model so natural.

The Mosaic Model

The biological intuition behind the LS model comes from the coalescent. When you trace the ancestry of a set of haplotypes backwards in time, each genomic position has a genealogical tree. Due to recombination, the tree changes along the genome. At each site, the query haplotype is most closely related to some reference haplotype – and that closest relative changes at recombination breakpoints.

Coalescent Theory in Brief

The coalescent is a stochastic model of how a sample of gene copies traces back to a common ancestor. Two key processes shape these ancestral histories:

  • Coalescence: two lineages merge into a common ancestor (looking backwards in time). The rate depends on the effective population size \(N_e\).

  • Recombination: a single ancestral chromosome is assembled from two parental chromosomes, so the left and right portions of the genome may have different genealogical trees.

The result is an ancestral recombination graph (ARG) – a collection of local trees that change along the genome at recombination breakpoints. The LS model is a practical approximation of this ARG structure: it captures the idea that the query’s closest relative in the reference panel changes at recombination events, without explicitly building the full ARG.

Li and Stephens (2003) turned this into a practical model:

A new haplotype is an imperfect mosaic of existing haplotypes.

“Imperfect” because mutations introduce differences between the query and its copying source. “Mosaic” because recombination switches the source. This is the template mechanism at the heart of the model – the query is stamped out from a sequence of templates drawn from the reference panel, with occasional imperfections (mutations) in the stamping process.

Reference panel (n=4 haplotypes, m=12 sites):

h_1: 0 0 1 0 0 1 1 0 0 1 0 0
h_2: 0 1 0 0 1 1 0 0 1 0 0 1
h_3: 1 0 0 1 0 0 1 1 0 0 1 0
h_4: 0 0 1 0 1 0 0 1 1 0 0 0

Query haplotype:
s:   0 0 1 0 1 1 0 0 1 0 0 1
           ^--- mutation        ^---------- mutation?
     |--------|---------|------|----------|
     copy h_1   copy h_4  h_2    copy h_2
           recomb    recomb         (match)

The query copies from \(h_1\) for the first few sites, then a recombination switches the source to \(h_4\), then to \(h_2\). At one site, the query differs from its source – that’s a mutation.

Now that we have the biological picture, let us formalize it as an HMM.

The HMM Formulation

This copying process maps directly to a Hidden Markov Model. If you have worked through the HMM chapter, the following table will feel familiar – it is the same framework, with the states and observations filled in by the biology:

HMM Component

LS Model Meaning

Notation

Hidden states

Which reference haplotype is being copied

\(Z_\ell \in \{1, 2, \ldots, n\}\)

Observations

The query allele at each site

\(s_\ell \in \{0, 1, \ldots\}\)

Initial distribution

Uniform: any haplotype equally likely

\(\pi_j = 1/n\)

Transitions

Recombination switches the copying source

\(A_{ij}\) (derived in The Copying Model)

Emissions

Mutation changes the allele relative to the source

\(e_j(s_\ell)\) (derived in The Copying Model)

Probability Aside: Why Uniform Initial Distribution?

Under the coalescent, exchangeability of lineages means that, before observing any data, the query is equally likely to be most closely related to any reference haplotype. This justifies \(\pi_j = 1/n\). A non-uniform prior could incorporate population structure (e.g., giving higher weight to reference haplotypes from the same population as the query), but the uniform prior is standard and works well in practice.

The key parameters are:

Parameter

Symbol

Meaning

Recombination probability

\(r\)

Probability of switching copying source between adjacent sites

Mutation probability

\(\mu\)

Probability that the query allele differs from the copied allele

Number of ref. haplotypes

\(n\)

Size of the reference panel

Number of sites

\(m\)

Number of genomic positions

Terminology: Site vs. Variant vs. Locus

In this book, a site (or locus) refers to a specific genomic position included in the analysis. In practice, many implementations consider only variant sites (positions where at least one individual in the reference panel differs from the others), but the LS model is defined over any set of positions. The recombination probability \(r_\ell\) between adjacent sites depends on their physical or genetic distance, so the spacing of sites matters.

With the HMM formulation in hand, we can now appreciate why the LS model is so widely used.

Why the LS Model Matters

The LS model is not just an academic exercise. It’s a foundational piece of machinery in modern genetics – a versatile gear that appears in many movements:

  1. Haplotype imputation (e.g., IMPUTE, Beagle, Minimac): Fill in missing genotypes by finding which reference haplotypes the query is copying from. The posterior copying probabilities weight the “votes” of different reference alleles.

  2. Phasing (e.g., SHAPEIT, Eagle): Resolve which alleles are on the same chromosome. The diploid LS model finds the most likely pair of copying paths.

  3. Ancestry painting (e.g., RFMix, MOSAIC): When reference haplotypes come from labeled populations, the copying path reveals local ancestry.

  4. ARG inference (e.g., tsinfer, SINGER): The LS model forms the backbone of threading algorithms that build ancestral recombination graphs.

  5. Selection scans: Unusually long copying tracts suggest reduced recombination or recent shared ancestry – potential signatures of selection.

Probability Aside: Data Likelihood and Model Comparison

The forward algorithm (introduced in the HMM chapter and detailed in Haploid LS HMM Algorithms) produces the data likelihood \(P(X_1, \ldots, X_m)\) as a byproduct. This quantity tells you how well the model explains the observed query, given the reference panel and model parameters. It is essential for:

  • Parameter estimation: finding the \(r\) and \(\mu\) that maximize the likelihood (or using EM to iterate).

  • Model comparison: comparing different reference panels or different model configurations via their likelihoods.

  • Composite likelihoods: products of per-individual likelihoods are used in population-level inference.

Having seen why the model matters, let us now map out how the remaining chapters build it up piece by piece.

The Pieces of the Mechanism

Here’s how the chapters connect. Think of this as an exploded diagram of a watch movement – each box is a sub-assembly, and the arrows show how they fit together:

Reference panel H + Query s
        |
        v
+---------------------------+
|   THE COPYING MODEL       |
|   (template mechanism)    |
|                           |
|  Transition: P(switch     |
|    copying source)        |
|  Emission: P(query allele |
|    | copying source)      |
|  Mutation rate estimation  |
|                           |
|  -> The O(K) trick        |
+---------------------------+
        |
        v
+---------------------------+
|   HAPLOID ALGORITHMS      |
|   (the gear train)        |
|                           |
|  Forward algorithm        |
|    -> P(data, state)      |
|  Backward algorithm       |
|    -> posterior decoding   |
|  Viterbi algorithm        |
|    -> best path (most     |
|       likely gear         |
|       sequence)           |
|  Scaling + memory tricks  |
+---------------------------+
        |
        v
+---------------------------+
|   DIPLOID EXTENSION       |
|   (two watches ticking    |
|    together)              |
|                           |
|  State space: n x n       |
|  Four transition types:   |
|    no switch, single,     |
|    double                 |
|  Genotype emissions       |
|  Forward-backward         |
|  Viterbi                  |
+---------------------------+
        |
        v
    Copying path / posteriors / likelihood

Each chapter builds on the previous one. The copying model defines the individual gears (transition and emission probabilities). The haploid algorithms assemble those gears into a working movement. The diploid extension doubles the mechanism to handle the reality that most organisms carry two copies of each chromosome.

Ready to Build

You now have the high-level blueprint. In the following chapters, we’ll build each gear from scratch:

  1. The Copying Model – The transition and emission probabilities (the template mechanism that defines how haplotypes copy from each other)

  2. Haploid LS HMM Algorithms – Forward, backward, and Viterbi for haploid data (the gear train that turns the model into answers)

  3. The Diploid Extension – Extending to diploid genotypes (two watches ticking together)

Each chapter derives the math, explains the intuition, implements the code, and verifies it works. By the end, you will own every gear in this mechanism completely – able to inspect it, modify it, or slot it into a larger complication.

Let’s start with the core of the model: the copying process.