.. _mainspring_comparison:

=============================
Comparison and Limitations
=============================

   *The finest complication is worthless if the watchmaker cannot tell you, honestly,
   when it keeps perfect time and when it does not.*

This chapter puts Mainspring in context. We compare it systematically against every
Timepiece, enumerate its fundamental limitations without euphemism, sketch possible
extensions, and describe the hybrid pipeline where Mainspring's speed meets
:ref:`Escapement's <escapement_complication>` statistical rigor.


Detailed Comparison
=====================

The following table compares Mainspring against each Timepiece along four axes: what
the classical method trades away, how Mainspring addresses that limitation, what
Mainspring sacrifices in return, and the practical consequence for a user choosing
between them.

.. list-table:: Mainspring vs. every Timepiece
   :header-rows: 1
   :widths: 12 22 22 22 22

   * - Timepiece
     - Classical limitation
     - How Mainspring addresses it
     - What Mainspring sacrifices
     - When to prefer the Timepiece
   * - :ref:`PSMC <psmc_timepiece>`
     - Two haplotypes only; piecewise-constant :math:`N_e(t)`; slow EM convergence
     - Processes :math:`n` samples jointly; continuous :math:`N_e(t)` via normalizing
       flow; single forward pass
     - No convergence guarantee; opaque learned representation
     - When you have a single diploid genome and need interpretable, reproducible
       inference with well-understood error properties
   * - :ref:`SMC++ <smcpp_timepiece>`
     - Distinguished lineage assumption; ODE discretization artifacts; limited to
       ~200 undistinguished samples
     - Permutation-equivariant encoder; no distinguished lineage; arbitrary :math:`n`
     - Trained on coalescent simulations, may not generalize to non-coalescent data
     - When you need multi-population split-time estimation (SMC++'s core strength)
   * - :ref:`ARGweaver <argweaver_timepiece>`
     - :math:`O(S^2 K)` per site; hours of MCMC for kilobase regions; limited to
       ~10 samples
     - Single forward pass; linear-time sliding-window attention; handles 50--100
       samples
     - No asymptotic exactness; posterior may be miscalibrated
     - When you need provably correct posterior samples and can afford the compute
   * - :ref:`SINGER <singer_timepiece>`
     - Gibbs sampling is slow; sequential processing; limited scalability
     - Parallel across genomic windows; batched GPU inference
     - No GP prior on branch lengths; less smooth posteriors
     - When the GP prior on branch lengths is scientifically important (e.g.,
       detecting rate variation)
   * - :ref:`tsinfer <tsinfer_timepiece>`
     - No posterior uncertainty; no node times; no demographic inference
     - Full posterior via gamma output heads; joint topology and dating; demographic
       decoder
     - Cannot scale to millions of samples (tsinfer's key advantage)
     - When you have biobank-scale data (>10,000 samples) and need only topology
   * - :ref:`tsdate <tsdate_timepiece>`
     - Requires fixed topology; factored posterior ignores cross-tree correlations
     - Joint topology and dating; GNN captures cross-tree consistency
     - tsdate's variational gamma is mathematically principled; Mainspring's GNN
       is a black box
     - When you have a high-quality tree sequence from tsinfer and need calibrated
       date posteriors
   * - :ref:`dadi <dadi_timepiece>`
     - Discards linkage; limited to ~3 populations; diffusion PDE is slow for
       high-dimensional SFS
     - Full sequence input retains linkage; SFS used as auxiliary loss
     - Cannot model arbitrary numbers of populations (dadi is more flexible for
       multi-population models)
     - When you need multi-population demographic inference with >3 populations
   * - :ref:`moments <moments_timepiece>`
     - Same as dadi (ODE rather than PDE); moment closure approximation
     - Same as for dadi; SFS loss provides physics-informed regularization
     - Same as for dadi
     - When you need fast SFS-based inference with well-characterized approximation
       error
   * - :ref:`Gamma-SMC <gamma_smc_timepiece>`
     - Two haplotypes only; no ARG topology; forward-only (no smoothing across the
       full chromosome)
     - Multi-sample; full ARG topology; bidirectional attention
     - Gamma-SMC's posterior is analytically motivated; Mainspring's gamma heads
       are learned
     - When you need analytically grounded pairwise coalescence-time posteriors
   * - :ref:`phlash <phlash_timepiece>`
     - Composite likelihood from pre-computed pairs; SVGD is expensive
     - End-to-end training; normalizing flow is fast at inference time
     - phlash has stronger theoretical grounding (score function estimator, SVGD
       convergence)
     - When you need Bayesian demographic inference with well-characterized
       convergence properties

.. admonition:: No free lunch

   Mainspring does not dominate any Timepiece on every axis. The pattern is
   consistent: Mainspring trades **statistical guarantees and interpretability** for
   **speed and multi-output capability**. This is the fundamental trade-off of
   amortized inference, and no architectural innovation can fully eliminate it.


What Mainspring Cannot Do
===========================

We identified six honest limitations in the :ref:`overview <mainspring_overview>`.
Here we expand on each with concrete failure modes.

1. The Simulation Fidelity Gap
---------------------------------

Mainspring's posterior is conditioned on the coalescent model implemented in msprime
being correct. The real data-generating process may include:

- **Gene conversion** (short-tract non-reciprocal recombination), which creates
  patterns that look like closely spaced double crossovers. msprime can simulate gene
  conversion, but it is not included in the default training prior.
- **Structural variants** (inversions, duplications, translocations), which violate
  the sequential Markov property by creating non-local genealogical correlations.
- **Sequencing and phasing errors**, which corrupt the genotype matrix in ways
  unrelated to the evolutionary process.
- **Background selection and selective sweeps**, which distort the genealogy in ways
  not captured by the neutral coalescent.

.. math::

   p_{\text{true}}(\mathbf{D}) \neq \int p(\mathbf{D} \mid \mathcal{A})\,
   p(\mathcal{A} \mid N_e)\, p(N_e)\, d\mathcal{A}\, dN_e

When the true generative process lies outside the model family, the network's
posterior :math:`q(N_e, \mathcal{A} \mid \mathbf{D})` may be arbitrarily wrong --
and, worse, it may be *confidently* wrong, because the training data never included
examples where the posterior should be diffuse.

**Mitigation.** Phase 4 of curriculum training (SLiM simulations) partially
addresses this, but cannot cover all possible model violations. Users should always
validate Mainspring's output against at least one classical method on a subset of
their data.

2. No Statistical Guarantees
-------------------------------

MCMC methods (ARGweaver, SINGER) produce samples from the true posterior given
enough iterations. Variational methods (tsdate's variational gamma) provide a lower
bound on the model evidence. Mainspring provides neither.

The network may produce:

- **Over-confident posteriors**: gamma distributions that are too narrow, covering
  the true time less often than the nominal credible level.
- **Biased point estimates**: systematically too young or too old node times in
  certain parts of the tree.
- **Poorly calibrated demographic posteriors**: normalizing flow samples that do not
  represent the true posterior density.

**Mitigation.** Monitor calibration on simulated validation data. If the 90%
credible interval covers the true value 85% of the time, the posteriors are
under-dispersed and should be inflated by a calibration factor.

3. Extrapolation Failure
--------------------------

Neural networks interpolate well and extrapolate poorly. If the training prior
covers :math:`N_e \in [100, 100{,}000]` and the true population experienced a
bottleneck of :math:`N_e = 10`, the network has never seen this regime and may
produce nonsensical output.

**Mitigation.** Use a training prior that is deliberately wider than the expected
range of real parameters. Validate on held-out simulations at the edges of the prior.
Flag predictions that fall near the boundary of the training distribution.

4. Interpretability
---------------------

A PSMC transition matrix can be inspected element by element: each entry has a clear
physical meaning (probability of coalescence time changing from interval :math:`k` to
interval :math:`l` between adjacent bins). Mainspring's attention weights and GNN
messages have no such direct interpretation.

We can probe the network with:

- **Attention maps**: which positions attend to which? Do breakpoints in attention
  correspond to true recombination breakpoints?
- **Ablation studies**: how does performance degrade when each design principle is
  removed?
- **Gradient-based attribution**: which input sites contribute most to the predicted
  time of a specific node?

But these are post-hoc analyses, not built-in interpretability.

5. Training Cost
------------------

A representative training run:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Resource
     - Requirement
   * - GPU
     - 4 × A100 (80 GB) for 3 days
   * - CPU (simulation)
     - 64 cores for on-the-fly msprime
   * - Storage
     - ~500 GB for checkpoints and logs
   * - Total GPU-hours
     - ~300 GPU-hours
   * - Estimated cloud cost
     - ~$600--1,200 (depending on provider)

This is a one-time cost, amortized across all future inference. But it places
Mainspring out of reach for labs without GPU access. By contrast, PSMC runs on a
laptop.

6. Recombination Map Dependency
----------------------------------

Mainspring requires a recombination map as input (or assumes a uniform rate). Errors
in the recombination map propagate into errors in the predicted ARG:

- **Under-estimated recombination rate** → too few predicted breakpoints → trees that
  are too wide, with averaged-out coalescence times.
- **Over-estimated recombination rate** → too many predicted breakpoints → fragmented
  trees with noisy time estimates.

Methods that operate on summary statistics (dadi, moments) are immune to this because
the SFS does not depend on the recombination map (only on the total branch length
distribution).


Possible Extensions
=====================

Mainspring as described handles a single panmictic population under neutrality. Several
extensions are natural:

**Population structure.** Replace the single-population demographic decoder with a
multi-population version that infers migration rates and divergence times. The encoder
and topology decoder are already population-agnostic (they process haplotypes without
population labels). The demographic decoder would condition on population assignments
(known or inferred) and output a structured demographic model.

**Natural selection.** Selection distorts the genealogy in characteristic ways:
selective sweeps produce star-like trees, background selection reduces effective
population size in low-recombination regions. A selection-aware Mainspring would add
a selection decoder that predicts a selection coefficient :math:`s` and a beneficial
allele frequency trajectory from the local tree shape.

**Ancient DNA.** Ancient samples are leaves at non-zero time in the tree. The encoder
can accommodate this by adding a "sampling time" feature to each leaf embedding. The
training simulations would include samples drawn from different time points.

**Iterative refinement with MCMC.** Mainspring's output can serve as the **initial
state** for a classical MCMC sampler (ARGweaver or SINGER). Instead of starting MCMC
from a random ARG, start from Mainspring's prediction. This can reduce burn-in time
from hours to minutes.

.. math::

   \mathcal{A}^{(0)} = f_\theta(\mathbf{D}), \quad
   \mathcal{A}^{(t+1)} \sim \text{MCMC}\bigl(\mathcal{A}^{(t)} \mid \mathbf{D}\bigr)

**Self-supervised pre-training.** Before training on labeled simulations, pre-train
the encoder on unlabeled genotype matrices using a masked-site prediction objective
(analogous to BERT's masked language modeling). This teaches the encoder useful
representations of haplotype structure without requiring expensive simulations.

.. code-block:: python

   def masked_site_pretraining_step(model, genotypes, mask_rate=0.15):
       """Self-supervised pre-training: predict masked sites."""
       mask = torch.bernoulli(torch.full_like(genotypes, mask_rate)).bool()
       masked_genotypes = genotypes.clone()
       masked_genotypes[mask] = 0  # mask token

       Z = model.encoder(masked_genotypes.unsqueeze(0))
       predictions = model.site_predictor(Z).squeeze(0)

       loss = F.binary_cross_entropy_with_logits(
           predictions[mask], genotypes[mask].float()
       )
       return loss


The Hybrid Pipeline: Mainspring + Escapement
===============================================

The most powerful use of Mainspring is not as a standalone method but as the first
stage of a two-stage pipeline with :ref:`Escapement <escapement_complication>`.

**Mainspring** provides a fast, approximate posterior over ARGs and demography.
**Escapement** provides a principled, simulation-free variational inference engine
that refines any initial estimate using the coalescent likelihood itself (not
simulations). Together:

.. code-block:: text

   Genotype matrix D
         |
         v
   ┌─────────────────────┐
   │     MAINSPRING       │  ~1 second
   │  (amortized, fast)   │
   └─────────────────────┘
         |
         v
   Initial ARG + N_e(t)
         |
         v
   ┌─────────────────────┐
   │     ESCAPEMENT       │  ~10 minutes
   │  (variational,       │
   │   likelihood-based)  │
   └─────────────────────┘
         |
         v
   Refined ARG + N_e(t)
   with calibrated posteriors

The hybrid pipeline combines the strengths of both:

.. list-table::
   :header-rows: 1
   :widths: 25 25 25 25

   * - Property
     - Mainspring alone
     - Escapement alone
     - Hybrid
   * - Speed
     - Seconds
     - Hours (from random init)
     - Minutes (warm-started)
   * - Statistical guarantees
     - None
     - ELBO bound
     - ELBO bound
   * - Posterior calibration
     - Approximate
     - Principled
     - Principled
   * - Simulation dependency
     - Yes (training)
     - No
     - Amortized (training only)
   * - Scalability
     - 50--100 samples
     - 20--50 samples
     - 50--100 samples
   * - Output
     - Full ARG + :math:`N_e(t)`
     - Coalescent times + :math:`N_e(t)`
     - Full ARG + :math:`N_e(t)` (refined)

.. admonition:: Why warm-starting matters

   Escapement's variational inference must optimize a complex, multi-modal objective
   (the coalescent likelihood as a function of the genealogy). From a random
   initialization, this can take thousands of gradient steps to converge, and may
   settle in a local optimum far from the truth.

   Mainspring's output provides an initialization that is already close to the global
   optimum. Escapement then needs only a few hundred gradient steps to refine the
   estimate and calibrate the posterior. The wall-clock time drops from hours to
   minutes, and the risk of poor local optima is greatly reduced.

   This is analogous to the role of a mainspring in a mechanical watch: the mainspring
   provides the initial burst of energy that sets the mechanism in motion. The
   escapement then regulates that energy into precise, calibrated motion. Neither is
   sufficient alone -- but together they keep perfect time.


When to Use What
==================

A practical decision guide:

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Scenario
     - Recommended approach
   * - Screening 1,000 genomes for demographic events
     - Mainspring alone (speed is paramount)
   * - Careful demographic inference from 50 samples
     - Hybrid: Mainspring → Escapement
   * - Single diploid genome, well-characterized species
     - :ref:`PSMC <psmc_timepiece>` (interpretable, proven, fast enough)
   * - Multi-population divergence times
     - :ref:`SMC++ <smcpp_timepiece>` or :ref:`dadi <dadi_timepiece>` (specialized
       for this task)
   * - Posterior samples from the full ARG, provably correct
     - :ref:`ARGweaver <argweaver_timepiece>` (no shortcut to exactness)
   * - Biobank-scale tree sequence (>10,000 samples)
     - :ref:`tsinfer <tsinfer_timepiece>` + :ref:`tsdate <tsdate_timepiece>` (only
       methods that scale)
   * - Teaching and understanding
     - The Timepieces, always (the whole point of this book)

.. admonition:: The watchmaker's perspective

   A grande complication is impressive, but the master watchmaker still keeps simple
   tools on the bench. The complication exists because the simpler mechanisms have
   been mastered first. If you have read this far, you have built every Timepiece by
   hand. You understand every gear. You can diagnose every failure mode.

   That understanding is what makes Mainspring useful rather than dangerous. Without
   it, the neural network is a black box that occasionally tells the wrong time. With
   it, the neural network is a powerful tool whose outputs you can check, calibrate,
   and trust -- because you know what the correct answer should look like.
