MCMC Sampling
The mainspring: wind it up, let it tick, and each tick produces a new sample from the posterior — a complete genealogical history of your data.
This chapter describes how ARGweaver’s gears — time discretization, transitions, and emissions — mesh together into a complete MCMC (Markov Chain Monte Carlo) sampler. The MCMC explores the space of ARGs by repeatedly removing one chromosome’s thread and re-sampling it from the conditional posterior.
We have now forged every individual gear:
Time Discretization — the tick marks on the dial
Transition Probabilities — the gear train that advances the hidden state
Emission Probabilities — the escapement that couples the mechanism to data
In this chapter, we assemble them into a working watch and wind the mainspring.
The Gibbs Sampling Strategy
ARGweaver uses Gibbs sampling (a special case of MCMC) rather than the Metropolis-Hastings framework used by SINGER’s SGPR. The difference is fundamental:
Metropolis-Hastings (SINGER): propose a new state, compute an acceptance ratio, accept or reject.
Gibbs sampling (ARGweaver): sample the new state exactly from the conditional posterior. No accept/reject step needed — every proposal is accepted.
This is possible because the discrete-time HMM allows exact computation of the conditional posterior \(P(\text{thread}_k \mid \text{ARG}_{-k}, \mathbf{D})\) via the forward–backward algorithm.
Closing the confusion gap — What is Gibbs sampling?
For a thorough introduction to MCMC and Gibbs sampling, see the prerequisite chapter Markov Chain Monte Carlo.
Gibbs sampling is a strategy for sampling from a joint probability distribution \(P(x_1, x_2, \ldots, x_n)\) when the joint distribution is too complex to sample from directly, but the conditional distributions \(P(x_k \mid x_1, \ldots, x_{k-1}, x_{k+1}, \ldots, x_n)\) are tractable.
The algorithm is:
Start with some initial values \((x_1^{(0)}, x_2^{(0)}, \ldots, x_n^{(0)})\)
For each iteration \(t\):
Sample \(x_1^{(t)} \sim P(x_1 \mid x_2^{(t-1)}, x_3^{(t-1)}, \ldots, x_n^{(t-1)})\)
Sample \(x_2^{(t)} \sim P(x_2 \mid x_1^{(t)}, x_3^{(t-1)}, \ldots, x_n^{(t-1)})\)
…and so on for all variables.
After many iterations, the samples \((x_1^{(t)}, \ldots, x_n^{(t)})\) are drawn from (approximately) the joint distribution \(P(x_1, \ldots, x_n)\).
In ARGweaver, the “variables” are the threads of the \(n\) chromosomes. At each iteration, one thread \(x_k\) is removed and re-sampled from its conditional posterior \(P(\text{thread}_k \mid \text{all other threads}, \text{data})\). This conditional is exactly the posterior of an HMM — the forward algorithm computes it, and stochastic traceback produces a sample.
The watch metaphor captures this perfectly: Gibbs sampling is systematically removing and replacing each gear. You pull out one gear (remove a chromosome’s thread), examine the space it left (the partial ARG), manufacture a new gear that fits exactly (sample from the conditional posterior via the HMM), and insert it. After cycling through all gears, the watch is in a new valid configuration.
Why Gibbs works here
The conditional distribution of one chromosome’s thread, given the rest of the ARG and the data, is exactly the posterior of an HMM with known parameters. The forward algorithm computes this posterior in \(O(L \cdot S^2)\) time, and stochastic traceback produces an exact sample. This is the same as “sampling from the full conditional” in Gibbs sampling theory.
The guarantee: if you cycle through all chromosomes and re-sample each one’s thread, the stationary distribution of the Markov chain is the joint posterior \(P(\mathcal{G} \mid \mathbf{D})\).
Probability Aside — Why Gibbs converges to the correct distribution
Gibbs sampling satisfies detailed balance with respect to the target distribution \(\pi(\mathbf{x}) = P(\mathcal{G} \mid \mathbf{D})\). For a single-variable update of \(x_k\), the transition probability from \(\mathbf{x}\) to \(\mathbf{x}'\) (which differs only in coordinate \(k\)) is \(T(\mathbf{x} \to \mathbf{x}') = P(x_k' \mid \mathbf{x}_{-k})\). Detailed balance requires \(\pi(\mathbf{x}) T(\mathbf{x} \to \mathbf{x}') = \pi(\mathbf{x}') T(\mathbf{x}' \to \mathbf{x})\). Since \(T(\mathbf{x} \to \mathbf{x}') = P(x_k' \mid \mathbf{x}_{-k}) = \pi(\mathbf{x}') / \pi(\mathbf{x}_{-k})\) and \(T(\mathbf{x}' \to \mathbf{x}) = P(x_k \mid \mathbf{x}_{-k}) = \pi(\mathbf{x}) / \pi(\mathbf{x}_{-k})\), both sides equal \(\pi(\mathbf{x}) \pi(\mathbf{x}') / \pi(\mathbf{x}_{-k})\). Detailed balance holds, so \(\pi\) is stationary.
Sampling the Initial Tree
Before the MCMC begins, ARGweaver needs an initial ARG. It builds one by threading haplotypes sequentially, starting from a coalescent tree for the first pair.
The initial tree is sampled from a coalescent with variable population sizes:
import random
from math import exp
def sample_tree(k, popsizes, times):
"""
Sample a coalescent tree using a discrete-time coalescent.
Starting with k lineages at time 0, simulate coalescence events
through the time grid. At each time interval, the coalescence rate
depends on the number of lineages and the local population size.
Parameters
----------
k : int
Number of lineages (chromosomes).
popsizes : list of float
Effective population size at each time interval.
times : list of float
Discretized time points.
Returns
-------
list of float
Coalescence times (one per coalescence event).
"""
ntimes = len(times)
coal_times = []
timei = 0
n = popsizes[timei]
t = 0.0
k2 = k # current number of lineages
while k2 > 1:
# Coalescent rate: k2 choose 2 pairs, each coalescing at rate 1/(2N).
# This is the standard coalescent rate from :ref:`coalescent_theory`.
coal_rate = (k2 * (k2 - 1) / 2) / float(n)
# Sample waiting time to next coalescence (exponential distribution)
t2 = random.expovariate(coal_rate)
if timei < ntimes - 2 and t + t2 > times[timei + 1]:
# Crossed into next time interval; update population size.
# Do NOT record a coalescence --- the lineage survived this
# interval and moves on to the next one with a new N_e.
timei += 1
t = times[timei]
n = popsizes[timei]
continue
t += t2
coal_times.append(t)
k2 -= 1 # one fewer lineage after coalescence
return coal_times
After discretizing the initial tree to the time grid, the algorithm threads additional haplotypes one at a time using the HMM.
Closing the confusion gap — Why start with a pairwise coalescence?
The simplest possible ARG has two haplotypes: just a single tree with one coalescence event. ARGweaver starts here because (a) sampling a two-haplotype coalescence requires no HMM at all (just draw a coalescence time from the coalescent prior), and (b) once you have an ARG for two haplotypes, you can thread a third using the full HMM machinery. The ARG grows from 2 to 3 to 4 to \(n\) haplotypes, each step using the HMM to find where the new lineage best fits. This initial ARG is not a posterior sample — it is just a starting point for the MCMC. The burn-in phase (below) lets the chain forget this initial configuration.
Sampling SPRs from the DSMC
Once the initial tree is built, the ARG is extended along the genome by sampling recombination events (SPRs) from the Discrete SMC. At each position, there’s a chance of recombination; if it occurs, the tree changes via an SPR.
The five steps of sampling an SPR mirror the transition probability derivation in Transition Probabilities: first determine whether a recombination occurs, then where on the tree and when in time, and finally where the lineage re-coalesces.
Step 1: Sample Recombination Position
Recombination events are a Poisson process along the genome with rate \(\rho \cdot L\) per base pair, where \(L\) is the total tree length.
def sample_next_recomb(treelen, rho):
"""
Sample the distance to the next recombination event.
The waiting time (in base pairs) is exponential with rate
rho * treelen.
Parameters
----------
treelen : float
Total tree length.
rho : float
Per-base recombination rate.
Returns
-------
float
Distance in base pairs to the next recombination.
"""
rate = max(treelen * rho, rho) # guard against zero tree length
# Exponential waiting time: the mean distance between recombination
# events is 1 / (rho * treelen) base pairs.
return random.expovariate(rate)
Step 2: Sample Recombination Time
Given that recombination occurs, the time is weighted by the amount of branch material at each time interval:
def sample_recomb_time(nbranches, time_steps, root_age_index):
"""
Sample which time interval the recombination falls in.
Probability is proportional to the amount of branch material
at each time interval: nbranches[k] * time_steps[k].
Parameters
----------
nbranches : list of int
Number of branches at each time interval.
time_steps : list of float
Time step sizes.
root_age_index : int
Time index of the root (recombination can only happen below).
Returns
-------
int
Time index of the recombination.
"""
# Weight each interval by total branch material = count * duration.
# More branches and longer intervals mean more opportunity for
# recombination to land in that interval.
weights = [nbranches[i] * time_steps[i]
for i in range(root_age_index + 1)]
total = sum(weights)
probs = [w / total for w in weights]
# Sample from categorical distribution using inverse CDF.
r = random.random()
cumsum = 0.0
for i, p in enumerate(probs):
cumsum += p
if r < cumsum:
return i
return len(probs) - 1
Step 3: Sample Recombination Node
Given the time, the recombination branch is chosen uniformly among all branches that exist at that time index (excluding the root):
def sample_recomb_node(states, recomb_time_index, root_name):
"""
Sample which branch the recombination falls on.
Parameters
----------
states : set of (str, int)
Valid coalescent states.
recomb_time_index : int
Time index of the recombination.
root_name : str
Name of the root node (excluded from recombination).
Returns
-------
str
Name of the node below the recombination point.
"""
# Filter to branches that exist at this time, excluding the root
# (recombination above the root has no effect since there is only
# one lineage there).
branches = [name for name, timei in states
if timei == recomb_time_index and name != root_name]
return random.choice(branches)
Step 4: Sample Coalescence Time
After the recombination, the detached lineage must re-coalesce. It floats upward through the time grid, with a chance to coalesce at each interval — the same discrete coalescent process used in the transition probabilities.
Probability Aside — The coalescent race
Re-coalescence is a “race” between the detached lineage and the existing tree. At each time interval, the detached lineage has \(n_{\text{branches}}[j]\) potential partners. The probability of coalescing with any one of them in a small time interval \(\Delta t\) is approximately \(n_{\text{branches}}[j] \cdot \Delta t / (2N_j)\). If the lineage fails to coalesce, it moves to the next interval and tries again.
This is the same discrete-coalescent process that generates the re-coalescence distribution in the transition probability derivation (Transition Probabilities). The only difference is that here we are sampling from the distribution (using random numbers) rather than computing it (enumerating all possibilities).
def sample_coal_time(recomb_time_index, nbranches, popsizes,
coal_times, ntimes, recomb_node, states):
"""
Sample the re-coalescence time after a recombination.
The lineage starts at recomb_time_index and moves upward,
with a hazard of coalescing at each time interval proportional
to the number of available lineages / (2 * Ne).
Parameters
----------
recomb_time_index : int
Time index where recombination occurred.
nbranches : list of int
Number of branches at each time interval.
popsizes : list of float
Population sizes at each time interval.
coal_times : list of float
Interleaved time points and midpoints.
ntimes : int
Number of time intervals.
recomb_node : object
The recombination node (used to adjust lineage count).
states : set of (str, int)
Valid coalescent states.
Returns
-------
int
Time index of re-coalescence.
"""
j = recomb_time_index
last_kj = nbranches[max(j - 1, 0)]
while j < ntimes - 1:
kj = nbranches[j]
# Adjust: if the recomb node passes through this interval,
# it shouldn't count as an available coalescence partner.
# (A lineage cannot coalesce with itself.)
if ((recomb_node.name, j) in states and
recomb_node.parents[0].age > times[j]):
kj -= 1
assert kj > 0
# Compute exposure in this interval using the interleaved
# coal_times structure (see :ref:`argweaver_time_discretization`).
A = (coal_times[2*j + 1] - coal_times[2*j]) * kj
if j > recomb_time_index:
A += (coal_times[2*j] - coal_times[2*j - 1]) * last_kj
# Trial: coalesce in this interval?
# Draw a Bernoulli trial with success probability = coal_prob.
coal_prob = 1.0 - exp(-A / float(popsizes[j]))
if random.random() < coal_prob:
break
# Survived this interval; move to the next tick mark on the dial.
j += 1
last_kj = kj
return j
Step 5: Sample Coalescence Node
Given the coalescence time, the branch is chosen uniformly among valid branches at that time, excluding the recombination node itself and certain relatives:
def sample_coal_node(states, coal_time_index, recomb_node, tree, times):
"""
Sample which branch the re-coalescing lineage joins.
Parameters
----------
states : set of (str, int)
Valid coalescent states.
coal_time_index : int
Time index of the re-coalescence.
recomb_node : object
The node where recombination occurred.
tree : tree object
The local tree.
times : list of float
Discretized time points.
Returns
-------
str
Name of the node below the coalescence point.
"""
coal_time = times[coal_time_index]
# Build exclusion set: the recomb node and its descendants
# at the same time (since coal points collapse).
# The lineage cannot re-coalesce with the subtree it just
# detached from --- that would create a trivial recombination.
exclude = set()
def walk(node):
exclude.add(node.name)
if node.age == coal_time:
for child in node.children:
walk(child)
walk(recomb_node)
# Also exclude the recomb node's parent at its time
exclude_parent = (recomb_node.parents[0].name,
times.index(recomb_node.parents[0].age))
# Filter valid branches: must be at the right time index
# and not in the exclusion set.
branches = [(name, timei) for name, timei in states
if timei == coal_time_index
and name not in exclude
and (name, timei) != exclude_parent]
chosen = random.choice(branches)
return chosen[0]
Recap of the five sampling steps: We sample (1) the genomic position of the recombination, (2) the time interval, (3) the specific branch, (4) the re-coalescence time, and (5) the re-coalescence branch. Together, these define one SPR that transforms the local tree into a new local tree at the next recombination breakpoint.
The Full MCMC Loop
With all the pieces in place, the full MCMC loop is:
def argweaver_mcmc(sequences, times, popsizes, rho, mu,
num_iters=1000, burn_in=200):
"""
Run the ARGweaver MCMC sampler.
Parameters
----------
sequences : dict of {str: str}
Aligned haplotype sequences.
times : list of float
Discretized time points.
popsizes : list of float
Population sizes at each time interval.
rho : float
Per-base recombination rate.
mu : float
Per-base mutation rate.
num_iters : int
Number of MCMC iterations.
burn_in : int
Number of burn-in iterations to discard.
Yields
------
arg : ARG object
Sampled ARG (one per iteration after burn-in).
"""
names = list(sequences.keys())
n = len(names)
# ---- Step 1: Build initial ARG ----
# Thread haplotypes one at a time, starting from a pairwise
# coalescence. This is NOT a posterior sample --- just a
# starting point for the Markov chain.
arg = build_initial_arg(sequences, times, popsizes, rho, mu)
# ---- Step 2: MCMC iterations ----
for iteration in range(num_iters):
# Choose a random chromosome to re-thread.
# This is the "remove a gear" step in the watch metaphor.
remove_idx = random.randint(0, n - 1)
remove_name = names[remove_idx]
# Remove this chromosome's thread from the ARG.
# This yields a partial ARG for n-1 haplotypes ---
# the "space left by the removed gear."
partial_arg = remove_thread(arg, remove_name)
# Build the HMM for re-threading.
# States: (branch, time_index) at each genomic position
# Transitions: normal (within same tree) or switch (at breakpoints)
# Emissions: parsimony-based likelihood
# This assembles the time grid, transitions, and emissions
# from the previous three chapters.
hmm = build_threading_hmm(
partial_arg, sequences[remove_name],
times, popsizes, rho, mu
)
# Run forward algorithm (see :ref:`argweaver_transitions`).
forward_probs = forward_algorithm(hmm)
# Stochastic traceback: sample a thread from the posterior.
# This is the "manufacture a new gear" step --- the new
# thread is drawn from the exact conditional distribution.
new_thread = stochastic_traceback(forward_probs, hmm)
# Add the new thread back into the ARG.
# The "insert the new gear" step.
arg = add_thread(partial_arg, remove_name,
sequences[remove_name], new_thread)
# Yield sample (after burn-in).
# The first burn_in iterations are discarded because the
# chain has not yet converged to the stationary distribution.
if iteration >= burn_in:
yield arg
MCMC Loop Diagram:
Iteration i:
+---------+ +---------+ +---------+ +---------+
| Current | | Remove | | Build | | Sample |
| ARG | --> | thread | --> | HMM for | --> | new |
| (n haps)| | for k | | thread k| | thread |
+---------+ +---------+ +---------+ +---------+
|
+-------------------------------+
|
v
+---------+
| Add new |
| thread | --> ARG for iteration i+1
| to ARG |
+---------+
Comparison with SINGER’s SGPR
ARGweaver and SINGER both use iterative re-threading, but their MCMC strategies differ in important ways:
Property |
ARGweaver |
SINGER (SGPR) |
|---|---|---|
Update unit |
Single chromosome (leaf) |
Sub-graph (can include internal nodes) |
Proposal mechanism |
Exact conditional (Gibbs) |
Data-informed proposal (MH) |
Accept/reject |
Always accepted |
Metropolis-Hastings ratio |
Acceptance rate |
100% (by construction) |
High (~90%+) due to informed proposal |
Time model |
Discrete (finite grid) |
Continuous |
HMM structure |
Single HMM (branch + time) |
Two HMMs (branch, then time) |
Per-iteration cost |
\(O(L \cdot S^2)\) where \(S \sim k \cdot n_t\) |
\(O(L \cdot k)\) per HMM |
Scaling with \(k\) |
\(O(k^2 n_t^2)\) per site |
\(O(k)\) per site (approximate) |
Mixing |
Slower (one leaf at a time) |
Faster (sub-graphs span multiple nodes) |
The mixing tradeoff
ARGweaver re-threads one chromosome at a time, which means it takes \(n\) iterations to give every chromosome a chance to move. SINGER’s SGPR can modify multiple nodes simultaneously, potentially exploring the posterior more efficiently.
However, ARGweaver’s exact conditional sampling means every update is statistically “perfect” given the rest of the ARG — there’s no wasted computation from rejected proposals. SINGER compensates for its approximate proposals with very high acceptance rates from data-informed sampling.
In practice, both methods produce good posterior samples. ARGweaver is better suited for smaller sample sizes (\(n < 50\)) with high accuracy requirements. SINGER scales to much larger sample sizes.
Probability Aside — Metropolis-Hastings vs. Gibbs acceptance rates
In Metropolis-Hastings, the acceptance probability for a proposal \(x' \sim q(x' \mid x)\) is
If the proposal distribution \(q\) equals the conditional posterior \(\pi(x' \mid x_{-k})\), then \(\alpha = 1\) always — this is exactly Gibbs sampling. The Gibbs sampler is the special case of MH where the proposal is so good that nothing is ever rejected.
SINGER’s MH proposals are data-informed but not exactly equal to the conditional posterior (because SINGER uses continuous time and approximate two-HMM decoupling). The resulting acceptance rates are high (~90%+) but not 100%, meaning some computation is “wasted” on rejected proposals. ARGweaver wastes no computation on rejections, but each proposal is more expensive to compute (larger state space).
Convergence and Diagnostics
Like any MCMC, ARGweaver needs monitoring to ensure convergence:
Burn-in: The initial ARG (built by sequential threading) may not be representative of the posterior. Discard the first several hundred iterations.
Thinning: Consecutive samples are correlated (they differ by only one chromosome’s thread). Thin by keeping every \(n\)-th sample (where \(n\) is the number of haplotypes) to reduce autocorrelation.
Diagnostics:
Joint log-likelihood \(\log P(\mathbf{D} \mid \mathcal{G})\): should stabilize
Total tree length: should fluctuate around an equilibrium
Pairwise TMRCA between specific pairs: check for convergence at specific loci
Closing the confusion gap — Why burn-in and thinning?
Burn-in addresses the fact that the MCMC starts from an arbitrary initial ARG (the one built by sequential threading). This initial ARG may be far from the high-probability region of the posterior. The chain needs time to “forget” its starting point and reach the stationary distribution. Discarding early samples (the burn-in period) avoids contaminating your estimates with these unrepresentative samples.
Thinning addresses autocorrelation: consecutive MCMC samples differ by only one chromosome’s thread, so they are highly correlated. If you keep every sample, your effective sample size (the number of independent samples) is much smaller than the total number of samples. By keeping only every \(n\)-th sample (one per “sweep” through all chromosomes), you reduce this correlation. A sweep is like one full rotation of the watch’s second hand — every gear has been replaced once.
def compute_log_likelihood(arg, sequences, mu, times):
"""
Compute the joint log-likelihood of the data given the ARG.
This sums the log emission probability at every site for the
actual topology encoded in the ARG. Useful as an MCMC diagnostic.
Parameters
----------
arg : ARG object
The current ARG sample.
sequences : dict of {str: str}
Observed sequences.
mu : float
Mutation rate.
times : list of float
Discretized time points.
Returns
-------
float
Joint log-likelihood.
"""
total_ll = 0.0
for (start, end), tree in iter_local_trees(arg):
for pos in range(start, end):
for node in tree:
if not node.parents:
continue
# Branch length: distance from node to parent.
# Floored to avoid log(0).
blen = max(node.get_dist(), times[1] * 0.1)
# Under Jukes-Cantor: P(no mut) = exp(-mu*blen)
# P(specific mut) = 1/3 * (1 - exp(-mu*blen))
parent_base = sequences.get(node.parents[0].name,
'N') # internal
child_base = sequences.get(node.name, 'N')
if parent_base == child_base:
total_ll += -mu * blen
else:
total_ll += log(1.0/3 * (1 - exp(-mu * blen)))
return total_ll
Final recap: ARGweaver is a digital watch. Its time grid (Time Discretization) provides the tick marks; its transition probabilities (Transition Probabilities) are the gear train; its emission probabilities (Emission Probabilities) are the escapement; and its Gibbs sampler (this chapter) is the mainspring. Each MCMC iteration removes one gear (thread), manufactures an exact replacement via the HMM, and inserts it. After many ticks, the watch reads the correct time — posterior samples of the ARG.
For the continuous-time alternative, see SINGER. For the simpler case of a single diploid genome (no ARG, just a sequence of coalescence times), see PSMC in Coalescent Theory. For the shared theoretical foundations, see The Sequentially Markov Coalescent and Ancestral Recombination Graphs.
Exercises
Exercise 1: Gibbs vs. Metropolis-Hastings
Prove that Gibbs sampling satisfies detailed balance. That is, show that if the update for variable \(x_k\) samples from \(P(x_k \mid x_{-k})\), then the joint distribution \(P(x_1, \ldots, x_n)\) is a stationary distribution of the resulting Markov chain.
Exercise 2: Mixing time analysis
Consider an ARG with \(n = 10\) chromosomes and \(L = 10{,}000\) sites. How many MCMC iterations are needed for every chromosome to be re-threaded at least once (in expectation)? This is the coupon collector problem — the expected number is \(n \cdot H_n\) where \(H_n\) is the \(n\)-th harmonic number.
Exercise 3: Implement the full loop
Using the building blocks from previous chapters (time discretization, transitions, emissions), implement a simplified version of the MCMC loop for a small example (4 haplotypes, 100 sites). Run 1000 iterations and plot the total tree length at a specific site over iterations. Does it converge?
Exercise 4: SINGER comparison
For the same dataset, run both an ARGweaver-style Gibbs sampler and a SINGER-style MH sampler (using simplified transition/emission models). Compare: (a) wallclock time per iteration, (b) effective sample size after 1000 iterations, (c) autocorrelation of pairwise TMRCA at a fixed site. Which sampler is more efficient per iteration? Per unit of wallclock time?
Solutions
Solution 1: Gibbs vs. Metropolis-Hastings
Claim: If the update for variable \(x_k\) samples from the full conditional \(P(x_k \mid \mathbf{x}_{-k})\), then the joint distribution \(\pi(\mathbf{x}) = P(x_1, \ldots, x_n)\) is stationary.
Proof via detailed balance: Consider a Gibbs update of coordinate \(k\). The transition kernel from state \(\mathbf{x}\) to \(\mathbf{x}'\) (which differs only in coordinate \(k\)) is:
By the definition of conditional probability:
Similarly:
Now check detailed balance:
Both sides are equal to \(\pi(\mathbf{x}) \pi(\mathbf{x}') / \pi(\mathbf{x}_{-k})\). Therefore detailed balance holds, and \(\pi\) is a stationary distribution of the chain. \(\square\)
In ARGweaver, \(x_k\) is the thread of chromosome \(k\), and \(\mathbf{x}_{-k}\) is the partial ARG of all other chromosomes. The HMM forward-backward algorithm computes \(P(\text{thread}_k \mid \text{partial ARG}, \text{data})\) exactly, and stochastic traceback samples from it — satisfying the Gibbs requirement.
Solution 2: Mixing time analysis
This is the coupon collector problem: if at each iteration we choose one of \(n\) chromosomes uniformly at random to re-thread, the expected number of iterations until every chromosome has been re-threaded at least once is:
def harmonic(n):
return sum(1.0 / i for i in range(1, n + 1))
n = 10
Hn = harmonic(n)
expected_iters = n * Hn
print(f"n = {n}")
print(f"H_n = {Hn:.4f}")
print(f"Expected iterations for full coverage: {expected_iters:.1f}")
# n = 10
# H_n = 2.9290
# Expected iterations for full coverage: 29.3
For \(n = 10\): \(H_{10} = 1 + 1/2 + 1/3 + \cdots + 1/10 \approx 2.929\), so the expected number of iterations is \(10 \times 2.929 \approx 29.3\).
This means after about 30 iterations, every chromosome has been re-threaded at least once in expectation. This is one “sweep” through the chromosomes. For good mixing, ARGweaver typically runs many sweeps (hundreds to thousands of total iterations), with thinning every \(n\) iterations.
Note: this analysis only ensures every chromosome gets one chance to move. Actual mixing — convergence to the stationary distribution — typically requires many more sweeps, because changing one chromosome’s thread only partially decorrelates the chain from its previous state.
Solution 3: Implement the full loop
import random
from math import exp, log
def simplified_mcmc(n_haps=4, n_sites=100, n_iters=1000,
Ne=10000, mu=1.4e-8, rho=1e-8,
ntimes=20, maxtime=160000, delta=0.01):
"""
Simplified ARGweaver MCMC loop for demonstration.
This implementation tracks only the total tree length at a
fixed site across iterations, omitting the full ARG data
structure for clarity.
Returns
-------
tree_lengths : list of float
Total tree length at site 0, one per iteration.
"""
times = get_time_points(ntimes=ntimes, maxtime=maxtime, delta=delta)
popsizes = [Ne] * (ntimes - 1)
# Initialize: sample coalescence times from the prior
coal_events = sorted(sample_tree(n_haps, popsizes, times))
tree_lengths = []
for iteration in range(n_iters):
# Pick a random chromosome to re-thread
remove_idx = random.randint(0, n_haps - 1)
# Simplified re-threading: re-sample one coalescence time
# from the coalescent prior (this is a gross simplification
# of the full HMM-based re-threading, but captures the
# spirit of the Gibbs update).
if coal_events:
# Remove one coalescence event
event_idx = min(remove_idx, len(coal_events) - 1)
coal_events.pop(event_idx)
# Re-sample a coalescence time from the prior
# (conditional on current number of lineages)
k_remaining = n_haps - len(coal_events)
if k_remaining >= 2:
new_coal = sample_tree(k_remaining, popsizes, times)
if new_coal:
coal_events.append(new_coal[0])
coal_events.sort()
# Compute total tree length at site 0
# (sum of all branch lengths in the tree)
total_length = 0.0
k = n_haps
prev_t = 0.0
for ct in coal_events:
total_length += k * (ct - prev_t)
prev_t = ct
k -= 1
# Add root branch (1 lineage above last coalescence)
if coal_events:
total_length += 1 * (times[-1] - coal_events[-1])
tree_lengths.append(total_length)
return tree_lengths
# Run and check convergence
tree_lengths = simplified_mcmc(n_iters=1000)
print(f"Tree length after burn-in (last 100 iterations):")
print(f" Mean: {sum(tree_lengths[-100:]) / 100:.0f}")
print(f" Stdev: {(sum((x - sum(tree_lengths[-100:])/100)**2 "
f"for x in tree_lengths[-100:]) / 100)**0.5:.0f}")
# The tree length should stabilize after burn-in.
# For n=4 haplotypes with Ne=10000, the expected total tree
# length is 2*Ne*(1 + 1/2 + 1/3) * 2 = approximately 73,000
# (the sum of expected coalescent waiting times times branch counts).
The total tree length should stabilize after an initial burn-in period (typically 100–200 iterations for this small example). The equilibrium value depends on \(N_e\) and the number of haplotypes. Plotting the tree length trace reveals the characteristic MCMC pattern: rapid initial changes (burn-in), followed by fluctuations around a stable mean (stationarity).
Solution 4: SINGER comparison
import time
def argweaver_style_iteration(states, transitions, emissions,
priors, forward_probs):
"""
One Gibbs iteration: run forward algorithm, then traceback.
Cost: O(L * S^2) for full matrix, O(L * S) with rank-1 trick.
Always accepted.
"""
# Forward pass: O(L * S)
for s in range(len(emissions)):
pass # placeholder for forward computation
# Stochastic traceback: O(L * S)
# Acceptance: 100%
return True # always accepted
def singer_style_iteration(branch_hmm_states, time_hmm_states,
emissions, data):
"""
One MH iteration: propose via two-HMM, accept/reject.
Cost: O(L * k) for branch HMM + O(L * n_t) for time HMM.
Accepted with probability alpha.
"""
# Branch HMM: O(L * k)
# Time HMM: O(L * n_t) per branch
# Acceptance ratio
import random
alpha = 0.92 # typical acceptance rate
return random.random() < alpha
# Comparison framework (conceptual)
n_haps = 10
n_sites = 10000
ntimes = 20
# ARGweaver cost per iteration
S_argweaver = n_haps * ntimes # ~200 states
cost_aw = n_sites * S_argweaver # O(L * S) with rank-1
# SINGER cost per iteration
cost_singer = n_sites * n_haps # O(L * k) for branch HMM
print(f"Cost comparison (n={n_haps}, L={n_sites}, nt={ntimes}):")
print(f" ARGweaver (with rank-1): O(L * k * nt) = "
f"O({n_sites} * {n_haps} * {ntimes}) = O({cost_aw})")
print(f" SINGER: O(L * k) = "
f"O({n_sites} * {n_haps}) = O({cost_singer})")
print(f" Ratio: ARGweaver / SINGER = {cost_aw / cost_singer:.0f}x")
# Effective sample size comparison
n_iters = 1000
acceptance_singer = 0.92
ess_aw = n_iters # all accepted
ess_singer = n_iters * acceptance_singer
print(f"\n ESS after {n_iters} iterations:")
print(f" ARGweaver: {ess_aw:.0f} (100% acceptance)")
print(f" SINGER: {ess_singer:.0f} "
f"({acceptance_singer*100:.0f}% acceptance)")
# Efficiency = ESS / cost
eff_aw = ess_aw / cost_aw
eff_singer = ess_singer / cost_singer
print(f"\n Efficiency (ESS / cost):")
print(f" ARGweaver: {eff_aw:.2e}")
print(f" SINGER: {eff_singer:.2e}")
print(f" SINGER / ARGweaver: {eff_singer / eff_aw:.1f}x")
Summary of the comparison:
(a) Wallclock time per iteration: ARGweaver is slower by a factor of \(\sim n_t\) (the number of time points, typically 20) because its state space is \(O(k \cdot n_t)\) vs. SINGER’s \(O(k)\). With the rank-1 optimization, the gap is \(n_t\)-fold rather than \((k \cdot n_t)^2 / k\)-fold.
(b) Effective sample size: ARGweaver’s 100% acceptance rate gives it a slight ESS advantage per iteration. SINGER’s ~92% acceptance rate means ~8% of iterations are wasted. However, SINGER’s faster iterations more than compensate.
(c) Autocorrelation: Both methods re-thread one chromosome per iteration, so the autocorrelation of pairwise TMRCA decays at a similar rate per sweep (one pass through all chromosomes). SINGER may mix faster if its sub-graph updates move multiple nodes simultaneously.
Conclusion: SINGER is more efficient per unit of wallclock time for \(k \gtrsim 10\) due to its \(O(k)\) scaling. ARGweaver is more efficient per iteration (no rejected proposals) and provides exact conditional samples, making it preferable for small \(k\) or when posterior accuracy is paramount.