Timepiece II: SMC++
From two sequences to many – demographic inference with the distinguished lineage.
The Mechanism at a Glance
SMC++ (Terhorst, Kamm & Song, 2017) extends PSMC from a single diploid genome to multiple unphased diploid genomes. Where PSMC reads population size history from two haplotypes – one simple watch with two hands – SMC++ adds more hands to the dial without requiring phased data or full ARG inference. The result is sharper resolution in the recent past, exactly where PSMC’s two-sequence approach runs out of steam.
The key insight is the distinguished lineage. Rather than tracking the full genealogy of all samples (which would require exponentially many states), SMC++ singles out one lineage and tracks how it relates to a demographic background of \(n - 1\) undistinguished lineages. The coalescence time of the distinguished lineage is hidden; the presence or absence of the other lineages provides additional signal about population size. This trick keeps the state space manageable while extracting far more information than PSMC’s two-haplotype approach.
If PSMC is a two-hand watch, SMC++ is a chronograph – a complication that adds sub-dials tracking multiple time measurements simultaneously. Each additional sample genome is another sub-dial, providing independent readings of the same demographic history. The distinguished lineage is the central seconds hand, and the undistinguished lineages sweep around their own sub-dials, all driven by the same escapement (the coalescent process under variable population size).
Primary Reference
The four gears of SMC++:
The Distinguished Lineage (the escapement) – The setup: one lineage is singled out, and its coalescence time \(T\) is tracked as a hidden variable. The remaining \(n - 1\) lineages form a demographic background that modifies the coalescence rate. This is where PSMC’s two-lineage framework generalizes to many.
The ODE System (the gear train) – A system of ordinary differential equations that tracks the probability \(p_j(t)\) that \(j\) undistinguished lineages remain at time \(t\). The matrix exponential of the rate matrix gives exact transition probabilities. This replaces PSMC’s simple exponential coalescence with a richer model.
The Continuous HMM (the mainspring) – A modified transition matrix built from the ODE rates, combined via composite likelihood across pairs of sites. Gradient-based optimization (L-BFGS or EM) estimates the piecewise-constant population size function \(\lambda(t)\). This is the inference engine.
Population Splits (a complication) – Cross-population analysis: modified ODEs that track lineage counts before and after a population split, enabling joint estimation of \(N_A(t)\), \(N_B(t)\), and the split time \(T_{\text{split}}\).
These gears mesh together into a complete inference machine:
Multiple unphased diploid genomes
|
v
+-------------------------+
| CHOOSE DISTINGUISHED |
| LINEAGE |
| |
| Pair it with each of |
| the n-1 undistinguished|
| lineages |
+-------------------------+
|
v
+-------> SOLVE ODE SYSTEM
| p_j(t): probability j
| undistinguished lineages
| remain at time t
| |
| v
| BUILD HMM
| States: discretized T
| Emissions: P(data | T)
| Transitions: from ODE
| |
| v
| COMPOSITE LIKELIHOOD
| across all pairs
| |
| v
| OPTIMIZE (L-BFGS)
| update lambda_k
| |
| Converged?
| NO ---+
|
YES
|
v
Output: lambda_0, ..., lambda_n
|
v
Scale to real units: N(t)
Prerequisites for this Timepiece
SMC++ builds directly on PSMC. Before starting, you should have worked through:
PSMC – the transition density, discretization, and HMM inference for two sequences. SMC++ generalizes every gear in PSMC.
Coalescent Theory – coalescence rates with multiple lineages, the relationship between population size and coalescence time
The SMC – the sequential Markov coalescent approximation
If you have built PSMC, you have most of the tools you need. SMC++ adds the multi-lineage generalization, but the underlying mathematical framework is the same.
Chapters
Each chapter derives the math, explains the intuition, implements the code, and verifies it works. By the end, you’ll have built a complete multi-sample demographic inference engine – and you’ll see how PSMC’s simple watch grows into a chronograph.