The Continuous HMM
From ODE rates to a working inference engine.
Building the Transition Matrix
In Timepiece I, PSMC’s transition matrix encodes how the coalescence time \(T\) changes between adjacent genomic positions. The same structure applies in SMC++, but with a modified transition density that accounts for the undistinguished lineages.
Recall the SMC transition mechanism from the SMC prerequisite:
At position \(a\), the coalescence time is \(T_a = s\) (the current state).
With probability \(1 - e^{-\rho s}\), a recombination occurs on the two branches (total length \(2s\)).
If recombination occurs, one lineage detaches at a time \(u \in [0, s]\) and must re-coalesce with the remaining lineage.
The new coalescence time \(T_{a+1}\) depends on when re-coalescence happens.
In PSMC, re-coalescence involves just two lineages: the detached lineage and the one it was separated from. In SMC++, the detached lineage may re-coalesce with any of the \(j\) lineages present at time \(u\) – the undistinguished lineages plus the one remaining from the original pair. The re-coalescence rate is therefore \(h(u)\) rather than \(1/\lambda(u)\).
This modifies the transition density. Where PSMC has:
SMC++ has:
The structure is identical – both are hazard-function formulas from survival analysis: the probability of re-coalescence at time \(t\) equals the instantaneous rate at \(t\) (first factor) times the probability of not having re-coalesced before \(t\) (exponential factor). The only difference is the rate function:
PSMC (\(1/\lambda(t)\)): In the pairwise case, the detached lineage has exactly one partner to coalesce with, and the coalescence rate with that partner is \(1/\lambda(t)\) (the inverse population size at time \(t\)).
SMC++ (\(h(t)\)): The detached lineage can coalesce with any of the \(j\) lineages present at time \(t\) – the \(j - 1\) undistinguished lineages plus the one remaining from the original pair. The rate is \(h(t) = E[J(t)] / \lambda(t)\), where \(E[J(t)]\) is the expected number of available lineages from the ODE system. When \(n = 1\) (no undistinguished lineages), \(E[J(t)] = 1\) and \(h(t)\) reduces to \(1/\lambda(t)\) – recovering PSMC exactly.
Biology Aside – Why more lineages help with recent history
In PSMC, the single pair of lineages coalesces quickly, so the HMM spends most of its time in deep coalescence-time states and has little power to resolve recent population size changes. In SMC++, the undistinguished lineages dramatically increase the coalescence rate in the recent past (many lineages means rapid coalescence), giving the HMM much more data about recent history. This is why SMC++ can resolve population size changes down to ~1,000 years ago while PSMC cannot see below ~20,000 years.
Discretization
Following PSMC, we discretize the coalescence time into \(K\) intervals \([t_0, t_1), [t_1, t_2), \ldots, [t_{K-1}, t_K)\) with log-spacing. The transition matrix \(P_{kl}\) gives the probability of transitioning from interval \(k\) to interval \(l\) between adjacent positions.
The discretized transition probability is:
where \(r_k\) is the probability of recombination given coalescence time in interval \(k\), and \(q_{kl}\) is the probability that re-coalescence lands in interval \(l\).
The key difference from PSMC is that \(q_{kl}\) is computed using \(h(t)\) instead of \(1/\lambda(t)\):
import numpy as np
from scipy.linalg import expm
def compute_transition_matrix(time_breaks, lambdas, rho, n_undist):
"""Build the SMC++ transition matrix.
Parameters
----------
time_breaks : array-like
K+1 time boundaries [t_0, ..., t_K].
lambdas : array-like
Relative population sizes in each time interval.
rho : float
Scaled recombination rate per bin.
n_undist : int
Number of undistinguished lineages.
Returns
-------
P : ndarray of shape (K, K)
Transition matrix.
"""
K = len(time_breaks) - 1
# First solve the ODE to get h(t) at each time break
Q_rate = build_rate_matrix(n_undist)
p0 = np.zeros(n_undist)
p0[-1] = 1.0
# Compute h(t) at midpoints of each interval
h = np.zeros(K)
p_current = p0.copy()
for k in range(K):
dt = time_breaks[k + 1] - time_breaks[k]
lam = lambdas[k]
# Expected undistinguished lineages at midpoint
M = expm(dt / (2 * lam) * Q_rate)
p_mid = M @ p_current
j_values = np.arange(1, n_undist + 1)
h[k] = np.dot(j_values, p_mid) / lam
# Advance to end of interval
M_full = expm(dt / lam * Q_rate)
p_current = M_full @ p_current
# Build transition matrix
P = np.zeros((K, K))
for k in range(K):
t_mid = (time_breaks[k] + time_breaks[k + 1]) / 2
lam_k = lambdas[k]
# Recombination probability: 1 - exp(-rho * t_mid)
r_k = 1 - np.exp(-rho * t_mid)
# No recombination: stay in same state
P[k, k] += 1 - r_k
# With recombination: transition to new state via h(t)
for l in range(K):
dt_l = time_breaks[l + 1] - time_breaks[l]
# Approximate: probability of landing in interval l
# proportional to h * exp(-integral of h) * interval width
q_kl = h[l] * np.exp(-sum(
h[m] * (time_breaks[m + 1] - time_breaks[m])
for m in range(l)
)) * dt_l
P[k, l] += r_k * q_kl
# Normalize rows
P = P / P.sum(axis=1, keepdims=True)
return P
# Import from ode_system chapter
from smcpp_ode_helpers import build_rate_matrix # conceptual import
Emission Probabilities
The emission probabilities are essentially the same as in PSMC. For a bin with coalescence time in interval \(k\), with midpoint time \(t_k^*\):
where \(\theta\) is the scaled mutation rate per bin.
For unphased diploid genotypes at a segregating site, the emission probabilities are modified to account for the ambiguity of which allele belongs to the distinguished lineage. The emission for a genotype \(g \in \{0, 1, 2\}\) (count of derived alleles) at the distinguished individual, given that the derived allele is present in some undistinguished lineages:
def emission_probability(genotype, t, theta, allele_count, n_undist):
"""Emission probability for unphased diploid data at the distinguished individual.
Parameters
----------
genotype : int
0, 1, or 2 (count of derived alleles at the focal individual).
t : float
Coalescence time of the distinguished lineage.
theta : float
Scaled mutation rate per bin.
allele_count : int
Number of derived alleles observed in the undistinguished panel.
n_undist : int
Total number of undistinguished haplotypes.
Returns
-------
float
P(genotype, allele_count | T = t).
"""
p_mut = 1 - np.exp(-theta * t)
# For the distinguished haplotype:
# If genotype = 0: both alleles ancestral -> (1 - p_mut)
# If genotype = 1: one derived, one ancestral -> p_mut (approximately)
# If genotype = 2: both derived -> p_mut (rare case)
# Simplified emission (full model integrates over allele assignments)
if genotype == 0:
return np.exp(-theta * t)
elif genotype == 1:
return 1 - np.exp(-theta * t)
else:
return (1 - np.exp(-theta * t)) ** 2
Composite Likelihood
SMC++ does not compute a single joint likelihood over all samples. Instead, it forms a composite likelihood: the product of marginal likelihoods computed for each choice of distinguished lineage.
For \(n\) diploid samples, let \(\mathbf{X}^{(i)}\) denote the data when individual \(i\) is the distinguished sample and the remaining \(n - 1\) individuals form the undistinguished panel. The composite log-likelihood is:
Each term \(\ell_i\) is computed using a standard HMM forward algorithm, with the transition matrix and emission probabilities specific to the \((n - 1)\)-lineage background.
def composite_log_likelihood(data, time_breaks, lambdas, theta, rho):
"""Compute the composite log-likelihood for SMC++.
Parameters
----------
data : list of ndarray
data[i] is the observation sequence for the i-th distinguished sample.
time_breaks : array-like
Time interval boundaries.
lambdas : array-like
Piecewise-constant population sizes.
theta : float
Scaled mutation rate.
rho : float
Scaled recombination rate.
Returns
-------
float
Composite log-likelihood.
"""
n_samples = len(data)
n_undist = 2 * n_samples - 1 # Haploid lineages minus the distinguished one
K = len(time_breaks) - 1
total_ll = 0.0
for i in range(n_samples):
# Build HMM for this distinguished sample
P = compute_transition_matrix(time_breaks, lambdas, rho, n_undist)
# Initial distribution (stationary)
pi = np.ones(K) / K # Simplified; true stationary from h(t)
# Forward algorithm
obs = data[i]
L = len(obs)
alpha = np.zeros((L, K))
# Initialize
for k in range(K):
t_mid = (time_breaks[k] + time_breaks[k + 1]) / 2
alpha[0, k] = pi[k] * emission_probability(obs[0], t_mid, theta, 0, n_undist)
# Scale for numerical stability
scale = np.zeros(L)
scale[0] = alpha[0].sum()
alpha[0] /= scale[0]
# Forward recursion
for a in range(1, L):
for l in range(K):
alpha[a, l] = sum(alpha[a - 1, k] * P[k, l] for k in range(K))
t_mid = (time_breaks[l] + time_breaks[l + 1]) / 2
alpha[a, l] *= emission_probability(obs[a], t_mid, theta, 0, n_undist)
scale[a] = alpha[a].sum()
if scale[a] > 0:
alpha[a] /= scale[a]
total_ll += np.sum(np.log(scale[scale > 0]))
return total_ll
Gradient-Based Optimization
Unlike PSMC, which uses the EM algorithm, SMC++ uses gradient-based optimization (L-BFGS-B or similar). The gradient of the composite log-likelihood with respect to the population size parameters \(\lambda_k\) can be computed efficiently using:
Automatic differentiation through the matrix exponential (using the eigendecomposition from the previous chapter).
The forward-backward algorithm to compute expected sufficient statistics.
The optimization proceeds as:
Initialize lambda_k = 1 for all k
|
v
+-------> Solve ODE system for current lambda
| |
| v
| Build HMM (transitions from h(t), emissions from theta)
| |
| v
| Forward algorithm -> composite log-likelihood
| |
| v
| Compute gradient d(ell_C) / d(lambda_k)
| |
| v
| L-BFGS-B update: lambda <- lambda + step * direction
| |
| Converged?
| NO ---+
|
YES
|
v
Output: optimized lambda_0, ..., lambda_K
The use of L-BFGS-B instead of EM has two advantages:
Faster convergence: L-BFGS-B typically converges in fewer iterations than EM, especially near the optimum where EM’s linear convergence is slow.
Regularization: It is straightforward to add penalty terms (e.g., smoothness penalties on \(\lambda_k\)) to the objective function, which helps prevent overfitting.
from scipy.optimize import minimize
def fit_smcpp(data, time_breaks, theta, rho, max_iter=100):
"""Fit SMC++ model using L-BFGS-B optimization.
Parameters
----------
data : list of ndarray
Observation sequences for each distinguished sample.
time_breaks : array-like
Time interval boundaries.
theta : float
Scaled mutation rate (fixed).
rho : float
Scaled recombination rate (fixed).
max_iter : int
Maximum optimization iterations.
Returns
-------
lambdas : ndarray
Estimated piecewise-constant population sizes.
"""
K = len(time_breaks) - 1
# Optimize in log-space for positivity
def objective(log_lambdas):
lambdas = np.exp(log_lambdas)
# Negative log-likelihood (minimize)
return -composite_log_likelihood(data, time_breaks, lambdas, theta, rho)
# Initial guess: constant population
x0 = np.zeros(K)
result = minimize(
objective,
x0,
method='L-BFGS-B',
options={'maxiter': max_iter, 'disp': True},
)
return np.exp(result.x)
Comparison with PSMC
The following table summarizes how SMC++ differs from PSMC:
Component |
PSMC |
SMC++ |
|---|---|---|
Input |
1 diploid genome (phased) |
1–hundreds of diploids (unphased) |
Hidden state |
Coalescence time \(T\) |
Same: coalescence time \(T\) |
Coalescence rate |
\(1/\lambda(t)\) |
\(h(t) = E[J(t)]/\lambda(t)\) |
Extra machinery |
None |
ODE system for \(p_j(t)\) |
Optimization |
EM algorithm |
L-BFGS-B with gradients |
Recent resolution |
Poor (\(> 20{,}000\) years) |
Good (down to \(\sim 1{,}000\) years) |
Phasing required |
Yes |
No |
The fundamental architecture is the same: an HMM whose hidden states are discretized coalescence times. SMC++ adds one new gear (the ODE system) and replaces one (EM with L-BFGS-B), but the overall structure is recognizably PSMC’s.
Summary
The continuous HMM is the mainspring of SMC++. It takes the coalescence rate \(h(t)\) from the ODE system and builds a complete inference engine: transition matrix, emission probabilities, composite likelihood across samples, and gradient-based optimization. The result is a demographic inference method that inherits PSMC’s elegant structure while dramatically improving resolution in the recent past.
In the final chapter, we extend SMC++ to handle population splits – estimating divergence times and population-specific size histories from cross-population comparisons.