Comparison and Limitations

The finest complication is worthless if the watchmaker cannot tell you, honestly, when it keeps perfect time and when it does not.

This chapter puts Mainspring in context. We compare it systematically against every Timepiece, enumerate its fundamental limitations without euphemism, sketch possible extensions, and describe the hybrid pipeline where Mainspring’s speed meets Escapement’s statistical rigor.

Detailed Comparison

The following table compares Mainspring against each Timepiece along four axes: what the classical method trades away, how Mainspring addresses that limitation, what Mainspring sacrifices in return, and the practical consequence for a user choosing between them.

Mainspring vs. every Timepiece

Timepiece

Classical limitation

How Mainspring addresses it

What Mainspring sacrifices

When to prefer the Timepiece

PSMC

Two haplotypes only; piecewise-constant \(N_e(t)\); slow EM convergence

Processes \(n\) samples jointly; continuous \(N_e(t)\) via normalizing flow; single forward pass

No convergence guarantee; opaque learned representation

When you have a single diploid genome and need interpretable, reproducible inference with well-understood error properties

SMC++

Distinguished lineage assumption; ODE discretization artifacts; limited to ~200 undistinguished samples

Permutation-equivariant encoder; no distinguished lineage; arbitrary \(n\)

Trained on coalescent simulations, may not generalize to non-coalescent data

When you need multi-population split-time estimation (SMC++’s core strength)

ARGweaver

\(O(S^2 K)\) per site; hours of MCMC for kilobase regions; limited to ~10 samples

Single forward pass; linear-time sliding-window attention; handles 50–100 samples

No asymptotic exactness; posterior may be miscalibrated

When you need provably correct posterior samples and can afford the compute

SINGER

Gibbs sampling is slow; sequential processing; limited scalability

Parallel across genomic windows; batched GPU inference

No GP prior on branch lengths; less smooth posteriors

When the GP prior on branch lengths is scientifically important (e.g., detecting rate variation)

tsinfer

No posterior uncertainty; no node times; no demographic inference

Full posterior via gamma output heads; joint topology and dating; demographic decoder

Cannot scale to millions of samples (tsinfer’s key advantage)

When you have biobank-scale data (>10,000 samples) and need only topology

tsdate

Requires fixed topology; factored posterior ignores cross-tree correlations

Joint topology and dating; GNN captures cross-tree consistency

tsdate’s variational gamma is mathematically principled; Mainspring’s GNN is a black box

When you have a high-quality tree sequence from tsinfer and need calibrated date posteriors

dadi

Discards linkage; limited to ~3 populations; diffusion PDE is slow for high-dimensional SFS

Full sequence input retains linkage; SFS used as auxiliary loss

Cannot model arbitrary numbers of populations (dadi is more flexible for multi-population models)

When you need multi-population demographic inference with >3 populations

moments

Same as dadi (ODE rather than PDE); moment closure approximation

Same as for dadi; SFS loss provides physics-informed regularization

Same as for dadi

When you need fast SFS-based inference with well-characterized approximation error

Gamma-SMC

Two haplotypes only; no ARG topology; forward-only (no smoothing across the full chromosome)

Multi-sample; full ARG topology; bidirectional attention

Gamma-SMC’s posterior is analytically motivated; Mainspring’s gamma heads are learned

When you need analytically grounded pairwise coalescence-time posteriors

phlash

Composite likelihood from pre-computed pairs; SVGD is expensive

End-to-end training; normalizing flow is fast at inference time

phlash has stronger theoretical grounding (score function estimator, SVGD convergence)

When you need Bayesian demographic inference with well-characterized convergence properties

No free lunch

Mainspring does not dominate any Timepiece on every axis. The pattern is consistent: Mainspring trades statistical guarantees and interpretability for speed and multi-output capability. This is the fundamental trade-off of amortized inference, and no architectural innovation can fully eliminate it.

What Mainspring Cannot Do

We identified six honest limitations in the overview. Here we expand on each with concrete failure modes.

1. The Simulation Fidelity Gap

Mainspring’s posterior is conditioned on the coalescent model implemented in msprime being correct. The real data-generating process may include:

  • Gene conversion (short-tract non-reciprocal recombination), which creates patterns that look like closely spaced double crossovers. msprime can simulate gene conversion, but it is not included in the default training prior.

  • Structural variants (inversions, duplications, translocations), which violate the sequential Markov property by creating non-local genealogical correlations.

  • Sequencing and phasing errors, which corrupt the genotype matrix in ways unrelated to the evolutionary process.

  • Background selection and selective sweeps, which distort the genealogy in ways not captured by the neutral coalescent.

\[p_{\text{true}}(\mathbf{D}) \neq \int p(\mathbf{D} \mid \mathcal{A})\, p(\mathcal{A} \mid N_e)\, p(N_e)\, d\mathcal{A}\, dN_e\]

When the true generative process lies outside the model family, the network’s posterior \(q(N_e, \mathcal{A} \mid \mathbf{D})\) may be arbitrarily wrong – and, worse, it may be confidently wrong, because the training data never included examples where the posterior should be diffuse.

Mitigation. Phase 4 of curriculum training (SLiM simulations) partially addresses this, but cannot cover all possible model violations. Users should always validate Mainspring’s output against at least one classical method on a subset of their data.

2. No Statistical Guarantees

MCMC methods (ARGweaver, SINGER) produce samples from the true posterior given enough iterations. Variational methods (tsdate’s variational gamma) provide a lower bound on the model evidence. Mainspring provides neither.

The network may produce:

  • Over-confident posteriors: gamma distributions that are too narrow, covering the true time less often than the nominal credible level.

  • Biased point estimates: systematically too young or too old node times in certain parts of the tree.

  • Poorly calibrated demographic posteriors: normalizing flow samples that do not represent the true posterior density.

Mitigation. Monitor calibration on simulated validation data. If the 90% credible interval covers the true value 85% of the time, the posteriors are under-dispersed and should be inflated by a calibration factor.

3. Extrapolation Failure

Neural networks interpolate well and extrapolate poorly. If the training prior covers \(N_e \in [100, 100{,}000]\) and the true population experienced a bottleneck of \(N_e = 10\), the network has never seen this regime and may produce nonsensical output.

Mitigation. Use a training prior that is deliberately wider than the expected range of real parameters. Validate on held-out simulations at the edges of the prior. Flag predictions that fall near the boundary of the training distribution.

4. Interpretability

A PSMC transition matrix can be inspected element by element: each entry has a clear physical meaning (probability of coalescence time changing from interval \(k\) to interval \(l\) between adjacent bins). Mainspring’s attention weights and GNN messages have no such direct interpretation.

We can probe the network with:

  • Attention maps: which positions attend to which? Do breakpoints in attention correspond to true recombination breakpoints?

  • Ablation studies: how does performance degrade when each design principle is removed?

  • Gradient-based attribution: which input sites contribute most to the predicted time of a specific node?

But these are post-hoc analyses, not built-in interpretability.

5. Training Cost

A representative training run:

Resource

Requirement

GPU

4 × A100 (80 GB) for 3 days

CPU (simulation)

64 cores for on-the-fly msprime

Storage

~500 GB for checkpoints and logs

Total GPU-hours

~300 GPU-hours

Estimated cloud cost

~$600–1,200 (depending on provider)

This is a one-time cost, amortized across all future inference. But it places Mainspring out of reach for labs without GPU access. By contrast, PSMC runs on a laptop.

6. Recombination Map Dependency

Mainspring requires a recombination map as input (or assumes a uniform rate). Errors in the recombination map propagate into errors in the predicted ARG:

  • Under-estimated recombination rate → too few predicted breakpoints → trees that are too wide, with averaged-out coalescence times.

  • Over-estimated recombination rate → too many predicted breakpoints → fragmented trees with noisy time estimates.

Methods that operate on summary statistics (dadi, moments) are immune to this because the SFS does not depend on the recombination map (only on the total branch length distribution).

Possible Extensions

Mainspring as described handles a single panmictic population under neutrality. Several extensions are natural:

Population structure. Replace the single-population demographic decoder with a multi-population version that infers migration rates and divergence times. The encoder and topology decoder are already population-agnostic (they process haplotypes without population labels). The demographic decoder would condition on population assignments (known or inferred) and output a structured demographic model.

Natural selection. Selection distorts the genealogy in characteristic ways: selective sweeps produce star-like trees, background selection reduces effective population size in low-recombination regions. A selection-aware Mainspring would add a selection decoder that predicts a selection coefficient \(s\) and a beneficial allele frequency trajectory from the local tree shape.

Ancient DNA. Ancient samples are leaves at non-zero time in the tree. The encoder can accommodate this by adding a “sampling time” feature to each leaf embedding. The training simulations would include samples drawn from different time points.

Iterative refinement with MCMC. Mainspring’s output can serve as the initial state for a classical MCMC sampler (ARGweaver or SINGER). Instead of starting MCMC from a random ARG, start from Mainspring’s prediction. This can reduce burn-in time from hours to minutes.

\[\mathcal{A}^{(0)} = f_\theta(\mathbf{D}), \quad \mathcal{A}^{(t+1)} \sim \text{MCMC}\bigl(\mathcal{A}^{(t)} \mid \mathbf{D}\bigr)\]

Self-supervised pre-training. Before training on labeled simulations, pre-train the encoder on unlabeled genotype matrices using a masked-site prediction objective (analogous to BERT’s masked language modeling). This teaches the encoder useful representations of haplotype structure without requiring expensive simulations.

def masked_site_pretraining_step(model, genotypes, mask_rate=0.15):
    """Self-supervised pre-training: predict masked sites."""
    mask = torch.bernoulli(torch.full_like(genotypes, mask_rate)).bool()
    masked_genotypes = genotypes.clone()
    masked_genotypes[mask] = 0  # mask token

    Z = model.encoder(masked_genotypes.unsqueeze(0))
    predictions = model.site_predictor(Z).squeeze(0)

    loss = F.binary_cross_entropy_with_logits(
        predictions[mask], genotypes[mask].float()
    )
    return loss

The Hybrid Pipeline: Mainspring + Escapement

The most powerful use of Mainspring is not as a standalone method but as the first stage of a two-stage pipeline with Escapement.

Mainspring provides a fast, approximate posterior over ARGs and demography. Escapement provides a principled, simulation-free variational inference engine that refines any initial estimate using the coalescent likelihood itself (not simulations). Together:

Genotype matrix D
      |
      v
┌─────────────────────┐
│     MAINSPRING       │  ~1 second
│  (amortized, fast)   │
└─────────────────────┘
      |
      v
Initial ARG + N_e(t)
      |
      v
┌─────────────────────┐
│     ESCAPEMENT       │  ~10 minutes
│  (variational,       │
│   likelihood-based)  │
└─────────────────────┘
      |
      v
Refined ARG + N_e(t)
with calibrated posteriors

The hybrid pipeline combines the strengths of both:

Property

Mainspring alone

Escapement alone

Hybrid

Speed

Seconds

Hours (from random init)

Minutes (warm-started)

Statistical guarantees

None

ELBO bound

ELBO bound

Posterior calibration

Approximate

Principled

Principled

Simulation dependency

Yes (training)

No

Amortized (training only)

Scalability

50–100 samples

20–50 samples

50–100 samples

Output

Full ARG + \(N_e(t)\)

Coalescent times + \(N_e(t)\)

Full ARG + \(N_e(t)\) (refined)

Why warm-starting matters

Escapement’s variational inference must optimize a complex, multi-modal objective (the coalescent likelihood as a function of the genealogy). From a random initialization, this can take thousands of gradient steps to converge, and may settle in a local optimum far from the truth.

Mainspring’s output provides an initialization that is already close to the global optimum. Escapement then needs only a few hundred gradient steps to refine the estimate and calibrate the posterior. The wall-clock time drops from hours to minutes, and the risk of poor local optima is greatly reduced.

This is analogous to the role of a mainspring in a mechanical watch: the mainspring provides the initial burst of energy that sets the mechanism in motion. The escapement then regulates that energy into precise, calibrated motion. Neither is sufficient alone – but together they keep perfect time.

When to Use What

A practical decision guide:

Scenario

Recommended approach

Screening 1,000 genomes for demographic events

Mainspring alone (speed is paramount)

Careful demographic inference from 50 samples

Hybrid: Mainspring → Escapement

Single diploid genome, well-characterized species

PSMC (interpretable, proven, fast enough)

Multi-population divergence times

SMC++ or dadi (specialized for this task)

Posterior samples from the full ARG, provably correct

ARGweaver (no shortcut to exactness)

Biobank-scale tree sequence (>10,000 samples)

tsinfer + tsdate (only methods that scale)

Teaching and understanding

The Timepieces, always (the whole point of this book)

The watchmaker’s perspective

A grande complication is impressive, but the master watchmaker still keeps simple tools on the bench. The complication exists because the simpler mechanisms have been mastered first. If you have read this far, you have built every Timepiece by hand. You understand every gear. You can diagnose every failure mode.

That understanding is what makes Mainspring useful rather than dangerous. Without it, the neural network is a black box that occasionally tells the wrong time. With it, the neural network is a powerful tool whose outputs you can check, calibrate, and trust – because you know what the correct answer should look like.