Overview of Mainspring
The master watchmaker does not discard the tools of their apprenticeship. They build a single instrument that embodies every lesson learned at the bench.
Every Timepiece in this book makes a bargain. PSMC trades multi-sample information for the elegance of a two-haplotype HMM. SMC++ gains samples but discretizes the coalescent into an ODE system. ARGweaver is exact under the discrete sequential Markov coalescent but costs \(O(S^2)\) per site and hours of MCMC per kilobase. tsinfer scales to millions of samples but surrenders posterior inference entirely, producing a single point estimate with no uncertainty. tsdate adds dates to the tree sequence but treats the topology as fixed. dadi and moments collapse the genome to a frequency histogram, discarding linkage information. Gamma-SMC maintains full posterior uncertainty over coalescence times but processes only two haplotypes at a time. phlash achieves scalable Bayesian demography but relies on composite likelihood.
Each Timepiece sits at a different point on the Pareto frontier between accuracy, scalability, and biological realism. No classical method occupies the corner where all three are maximized – the computational cost of exact inference under the full coalescent is simply too high.
Mainspring is an attempt to break this frontier. Not by inventing new population genetics – every equation in this book remains valid – but by compiling the structural insights of every Timepiece into a single neural architecture that learns to perform approximate posterior inference in a single forward pass.
What Mainspring Does
Given a genotype matrix \(\mathbf{D} \in \{0,1\}^{n \times L}\) (where \(n\) is the number of haploid samples and \(L\) is the number of segregating sites), Mainspring produces:
A full ancestral recombination graph (ARG) in tskit format – topology, breakpoints, and node times.
A posterior distribution over effective population size trajectories \(N_e(t)\).
Both outputs emerge from a single forward pass through the network. No MCMC sampling, no EM iterations, no iterative optimization at inference time.
The function \(f_\theta\) is a neural network with parameters \(\theta\) trained on millions of simulated datasets from msprime. At training time, we have access to the true ARG and true demography for each simulation. At inference time, we have only the genotype matrix.
Amortized inference
The term amortized means that the computational cost of inference is paid once, during training, and then amortized across all future datasets. Training Mainspring takes days on a GPU cluster. But once trained, inference on a new dataset takes seconds. This is the same economics as compiling a program: the compiler is slow, but the compiled binary is fast. The simulations are the source code; the trained network is the compiled binary.
For a thorough introduction to amortized inference, see The Other Paradigm: Neural Networks and Amortized Inference.
Why “Structure-Aware” Matters
Neural approaches to population genetics are not new. pg-gan (Sheehan & Song 2016) used a GAN to match simulated and observed SFS. ImaGene (Torada et al. 2019) applied a CNN to genotype matrices for selection detection. ReLERNN (Adrion et al. 2020) used an RNN for recombination rate estimation. These methods treat the genotype matrix as a generic image or sequence, applying off-the-shelf architectures without encoding any domain knowledge.
Mainspring is different in a specific, measurable way: every architectural choice corresponds to a mathematical property of the coalescent.
Property |
Generic approach |
Mainspring |
|---|---|---|
Positional structure |
Flat convolution or full attention |
Sliding-window causal attention (sequential Markov property from PSMC) |
Sample exchangeability |
Fixed sample ordering |
Set Transformer, permutation-equivariant (from SMC++) |
Output format |
Scalar summary statistics |
|
Tree dating |
Regression to point estimate |
GNN message passing with gamma posteriors (from tsdate / Gamma-SMC) |
Haplotype relationships |
Implicit in convolution filters |
Cross-attention as Li & Stephens copying (from tsinfer / lshmm) |
Demographic output |
Point estimate of \(N_e\) |
Normalizing flow posterior \(q(N_e(t))\) (from phlash) |
Physics regularization |
None |
The result is not that Mainspring is merely “better” in some generic sense – it is that the network converges faster, generalizes better to out-of-distribution demographies, and produces calibrated uncertainty estimates, because the architecture encodes the right inductive biases.
The Key Insight: Population Genetics Compiled into Architecture
The central thesis of Mainspring can be stated in one sentence:
Core thesis
The mathematical structure of population genetics – the sequential Markov property, permutation invariance of exchangeable samples, the Li & Stephens copying model, message-passing on trees, gamma-distributed coalescence times, and the site frequency spectrum as a sufficient statistic for demography – can be compiled into neural network architecture rather than hard-coded into likelihood functions.
What does “compiled” mean here? Consider the analogy to a mechanical watch. Each Timepiece in this book hand-crafts a specific gear train: PSMC builds a transition matrix, tsdate implements belief propagation, ARGweaver constructs an MCMC sampler. These are interpreted approaches – they execute the population-genetic equations step by step at inference time.
Mainspring takes a different approach. It compiles the equations into a fixed neural circuit during training. The circuit does not execute the equations – it has learned to shortcut them. But the circuit’s wiring diagram (the architecture) mirrors the structure of the equations, so the shortcuts are faithful.
This is why Mainspring is a Complication, not a replacement. A complication in horology adds functionality to the basic movement without altering the movement itself. The Timepieces are the movement. Mainspring is a complication that adds speed – but only because the movement is sound.
What Each Timepiece Trades Away
To appreciate what Mainspring attempts to unify, we must be precise about what each Timepiece sacrifices.
Timepiece |
Core mechanism |
What it trades away |
Mainspring’s response |
|---|---|---|---|
HMM on two haplotypes |
Multi-sample information; piecewise-constant \(N_e(t)\) |
Process all samples jointly; continuous \(N_e(t)\) via normalizing flow |
|
ODE-based HMM with distinguished lineage |
ARG topology; limited to \(\sim 200\) undistinguished samples |
Output full ARG; permutation-equivariant encoder handles arbitrary \(n\) |
|
MCMC sampling of full ARG |
Speed (\(O(S^2 K)\) per site); limited scalability |
Single forward pass; linear-time sliding-window attention |
|
Gibbs sampling of ARG with GP prior |
Speed (hours for megabase regions); sequential processing |
Parallel across genomic windows; batched inference |
|
Li & Stephens ancestor matching |
Posterior uncertainty; dates; demographic inference |
Posterior via dropout/ensemble; GNN dating; demographic decoder |
|
Inside-outside on fixed topology |
Topology uncertainty; demographic inference |
Joint topology and dating; demographic decoder |
|
Diffusion equation for SFS |
Linkage information; limited to \(\sim 3\) populations |
Full sequence input; population structure as latent variable |
|
ODE system for SFS moments |
Same as dadi (different numerical approach) |
SFS used as auxiliary loss, not sole input |
|
Gamma-distributed coalescence-time posteriors |
Two haplotypes only; no ARG topology |
Gamma output heads on multi-sample GNN |
|
SVGD over \(N_e(t)\) with composite likelihood |
Composite likelihood approximation; pre-computed pairs |
End-to-end training; full likelihood via simulation |
Honest Limitations
Mainspring is not a universal solution. It has fundamental limitations that no amount of engineering can fully resolve.
1. The simulation fidelity gap. Mainspring is only as good as its training simulations. If the real data-generating process includes features absent from msprime – gene conversion, structural variants, sequencing error, population structure not captured by the demographic model – the network may produce confidently wrong answers. This is the Achilles’ heel of all simulation-based inference: the posterior is conditioned on the simulator being correct.
2. No statistical guarantees. Unlike MCMC methods (which are asymptotically exact given enough iterations) or variational methods (which provide a lower bound on the evidence), Mainspring’s posterior approximation has no formal guarantees. The network may be miscalibrated, especially in regions of parameter space poorly represented in the training set.
3. Extrapolation. Neural networks extrapolate poorly. If the true demography lies outside the prior used to generate training data – a bottleneck more severe than any in the training set, a population size larger than any simulated – the network will struggle. The training prior must be chosen carefully and validated against held-out scenarios.
4. Interpretability. A Timepiece’s likelihood function can be inspected term by term. The PSMC transition matrix tells you exactly how recombination and population size interact. Mainspring’s learned representations are opaque. We can probe them with attention maps and ablation studies, but we cannot point to a specific neuron and say “this computes the coalescence rate.”
5. Training cost. Training Mainspring requires millions of msprime simulations (each producing a full ARG), hundreds of GPU-hours, and careful hyperparameter tuning. This is a one-time cost, but it is substantial. A lab without GPU resources may find classical methods more practical.
6. Recombination map dependency. Mainspring assumes a known recombination map. Errors in the recombination map propagate into errors in the inferred ARG and demography. This limitation is shared with ARGweaver and SINGER but not with methods that operate on summary statistics (dadi, moments).
When to use Mainspring vs. a Timepiece
Use a Timepiece when you need interpretable, guaranteed inference on a well-characterized problem (e.g., estimating \(N_e(t)\) from a single diploid genome with PSMC). Use Mainspring when you need fast, approximate inference on many datasets and are willing to validate against classical methods on a subset.
The best practice is to use both: Mainspring for rapid screening, a Timepiece for careful analysis of interesting cases. This is the hybrid pipeline described in Comparison and Limitations.
The Road Ahead
The remaining chapters of this Complication build Mainspring from the ground up:
Design Principles – One Per Timepiece – Ten design principles, one from each Timepiece. These are the architectural decisions that distinguish Mainspring from a generic neural network.
Architecture – The four-stage architecture in full detail: genomic encoder, topology decoder, dating GNN, and demographic decoder. PyTorch pseudocode for every component.
Training – The simulation engine, the composite loss function, the curriculum training strategy, and the complete training loop.
Comparison and Limitations – A systematic comparison against every Timepiece, honest limitations revisited, and the hybrid pipeline that combines Mainspring with Escapement for principled refinement.
Each chapter follows the book’s rhythm: motivation, math, code, verification. But the “math” here is architecture design – the translation of population-genetic structure into neural network components. And the “verification” is not analytical but empirical: we check that the network recovers known truths from simulated data.
A minimal example showing Mainspring’s interface (assuming a trained model):
import torch
import msprime
# Simulate a test dataset
ts = msprime.sim_ancestry(20, sequence_length=1e5, recombination_rate=1e-8,
population_size=10_000, random_seed=42)
ts = msprime.sim_mutations(ts, rate=1.25e-8, random_seed=42)
D = torch.tensor(ts.genotype_matrix().T, dtype=torch.float32).unsqueeze(0)
# Run inference (single forward pass)
model = Mainspring.load_pretrained("mainspring_v1.pt")
model.eval()
with torch.no_grad():
results = model(D, hard=True)
predicted_arg = results['topology']
ne_posterior = results['ne_posterior'] # samples from q(N_e(t))
node_times = results['times'] # gamma means
Let us begin with the design principles that make Mainspring more than a black box.