Gear 3: Ancestor Matching

Build the scaffold from the top down. The oldest beams bear the weight of everything below.

Ancestor matching is where the tree sequence takes shape – where we assemble the movement. We take the putative ancestors generated in Gear 1 and thread them together using the copying model from Gear 2, building the genealogical scaffold from the root down to the most recent ancestors.

The principle is simple: each ancestor is a mosaic of older ancestors. By processing ancestors from oldest to youngest, we ensure that when we match an ancestor, all its potential parents are already in the tree. In a watchmaker’s workshop, this is the stage where the mainplate is laid down first, then the bridges, then the wheel train – each layer resting on the one below it.

Prerequisites

This chapter builds directly on:

Step 1: The Matching Order

Ancestors are processed from oldest to youngest. Recall that “time” is the derived allele frequency:

\[t_a = \frac{\text{count of derived alleles at focal site of } a}{n}\]

The oldest ancestor (the ultimate ancestor at \(t = 1.0\)) is placed first. It carries the ancestral allele at every site and serves as the root of the tree.

Next, ancestors are processed in time groups: all ancestors at the same time are matched simultaneously against the already-placed ancestors.

Probability Aside – Why Oldest First?

The oldest-first ordering is not arbitrary. It follows from a fundamental property of genealogies: an ancestor at time \(t_1\) can only be the parent of an ancestor at time \(t_2 < t_1\) (you must exist before your descendants). By processing from oldest to youngest, we guarantee that when we run the Li & Stephens HMM for a given ancestor, every possible parent is already in the reference panel. This is a topological sort of the genealogical DAG. If we processed ancestors in random order, a young ancestor might be placed before its true parent, and the Viterbi algorithm would be forced to choose a suboptimal copying source.

import numpy as np

def matching_order(ancestors):
    """Determine the order for ancestor matching.

    Parameters
    ----------
    ancestors : list of dict
        Ancestors with 'time' field, sorted oldest first.

    Returns
    -------
    groups : list of list of dict
        Groups of ancestors at the same time.
    """
    groups = []
    current_time = None
    current_group = []

    for anc in ancestors:
        if anc['time'] != current_time:
            # New time group -- flush the previous group
            if current_group:
                groups.append(current_group)
            current_group = [anc]
            current_time = anc['time']
        else:
            # Same time -- add to current group
            current_group.append(anc)

    if current_group:
        groups.append(current_group)

    return groups

# Example
ancestors_example = [
    {'time': 1.0, 'focal': -1},   # Ultimate ancestor
    {'time': 0.8, 'focal': 3},
    {'time': 0.8, 'focal': 7},
    {'time': 0.6, 'focal': 1},
    {'time': 0.4, 'focal': 5},
    {'time': 0.4, 'focal': 9},
    {'time': 0.4, 'focal': 12},
    {'time': 0.2, 'focal': 2},
]

groups = matching_order(ancestors_example)
for i, group in enumerate(groups):
    times = [a['time'] for a in group]
    focals = [a['focal'] for a in group]
    print(f"Group {i}: time={times[0]:.1f}, "
          f"{len(group)} ancestors, focals={focals}")

With the matching order established, the next step is to construct the reference panel that each ancestor will be matched against.

Step 2: Building the Reference Panel

When matching ancestors in time group \(g\), the reference panel consists of all ancestors from groups \(0, 1, \ldots, g-1\) (i.e., all strictly older ancestors).

The panel is stored as a matrix \(H\) of shape \((m, k)\) where \(m\) is the number of inference sites and \(k\) is the number of older ancestors. Entries where an ancestor doesn’t span a site are set to NONCOPY.

Confusion Buster – Why the Panel Grows Over Time

Notice that the reference panel gets larger with each time group. When matching the oldest real ancestors (those just below the ultimate ancestor), the panel contains only the ultimate ancestor – a single all-zeros haplotype. When matching the youngest ancestors, the panel contains every older ancestor. This means the HMM has more states to choose from for younger ancestors, which makes their matching both more accurate (more potential parents) and more expensive (larger state space). The \(O(k)\) trick from Gear 2 keeps the per-site cost linear in \(k\).

NONCOPY = -2

def build_reference_panel(placed_ancestors, num_inference_sites):
    """Build the reference panel from already-placed ancestors.

    Parameters
    ----------
    placed_ancestors : list of dict
        Ancestors already in the tree sequence.
    num_inference_sites : int
        Total number of inference sites.

    Returns
    -------
    panel : ndarray of shape (num_inference_sites, k)
        Reference panel with NONCOPY entries.
    node_ids : list of int
        Node ID for each column of the panel.
    """
    k = len(placed_ancestors)
    # Initialize everything as NONCOPY (not available for copying)
    panel = np.full((num_inference_sites, k), NONCOPY, dtype=int)

    for col, anc in enumerate(placed_ancestors):
        start = anc['start']
        end = anc['end']
        # Fill in the ancestor's haplotype where it is defined
        panel[start:end, col] = anc['haplotype']

    node_ids = [anc.get('node_id', col) for col, anc in
                enumerate(placed_ancestors)]
    return panel, node_ids

# Example
placed = [
    {'haplotype': np.zeros(10, dtype=int), 'start': 0, 'end': 10,
     'node_id': 0},  # Ultimate ancestor
    {'haplotype': np.array([0, 0, 1, 1, 0, 0, 1, 0]),
     'start': 1, 'end': 9, 'node_id': 1},
]
panel, node_ids = build_reference_panel(placed, num_inference_sites=10)
print(f"Panel shape: {panel.shape}")
print(f"Panel (showing NONCOPY as '.'):")
for site in range(10):
    row = ''.join(str(x) if x >= 0 else '.' for x in panel[site])
    print(f"  Site {site}: {row}")

Now we have the pieces: ancestors to match, a reference panel to match against, and the Viterbi engine from Gear 2. Let’s put them together.

Step 3: Match and Add Edges

For each ancestor in the current time group, we:

  1. Run the Viterbi algorithm against the reference panel

  2. Convert the Viterbi path to edges

  3. Add the edges and the ancestor node to the growing tree sequence

Each edge represents a segment of the ancestor’s genome that was “copied” from a specific older ancestor – in genealogical terms, a parent-child relationship over a genomic interval.

class TreeSequenceBuilder:
    """Incrementally builds a tree sequence from matching results.

    This is a simplified version of tsinfer's internal builder.
    It accumulates nodes and edges as ancestors and samples are matched.
    """

    def __init__(self, sequence_length, num_inference_sites, positions):
        self.sequence_length = sequence_length
        self.positions = positions
        self.num_inference_sites = num_inference_sites
        self.nodes = []   # (time, is_sample)
        self.edges = []   # (left, right, parent, child)
        self.next_id = 0

    def add_node(self, time, is_sample=False):
        """Add a node and return its ID."""
        node_id = self.next_id
        self.nodes.append({'id': node_id, 'time': time,
                           'is_sample': is_sample})
        self.next_id += 1
        return node_id

    def add_edges_from_path(self, path, child_id, ref_node_ids):
        """Convert a Viterbi path to edges and add them.

        Parameters
        ----------
        path : ndarray of shape (m,)
            Viterbi path (index into reference panel).
        child_id : int
            Node ID of the child.
        ref_node_ids : list of int
            Node IDs for each reference index.
        """
        m = len(path)
        # Walk through the path, emitting one edge per contiguous segment
        seg_start = 0
        current_ref = path[0]

        for ell in range(1, m):
            if path[ell] != current_ref:
                # Copying source changed -- emit edge for old segment
                left = self.positions[seg_start]
                right = self.positions[ell]
                parent = ref_node_ids[current_ref]
                self.edges.append((left, right, parent, child_id))
                seg_start = ell
                current_ref = path[ell]

        # Final segment extends to end of sequence
        left = self.positions[seg_start]
        right = self.positions[m - 1] + 1  # Or sequence_length
        parent = ref_node_ids[current_ref]
        self.edges.append((left, right, parent, child_id))

    def summary(self):
        """Print a summary of the tree sequence."""
        print(f"Nodes: {len(self.nodes)}")
        print(f"Edges: {len(self.edges)}")
        samples = sum(1 for n in self.nodes if n['is_sample'])
        print(f"Samples: {samples}")

# Example: build a small tree sequence
positions = np.arange(0, 10000, 1000, dtype=float)
builder = TreeSequenceBuilder(
    sequence_length=10000,
    num_inference_sites=10,
    positions=positions
)

# Add the ultimate ancestor (root)
root_id = builder.add_node(time=1.0)
print(f"Root node: {root_id}")

# Add an ancestor at time 0.8
anc_id = builder.add_node(time=0.8)
# Suppose Viterbi says it copies from the root everywhere
path = np.zeros(10, dtype=int)  # Always copying from ref 0 (the root)
builder.add_edges_from_path(path, anc_id, ref_node_ids=[root_id])

builder.summary()

With the individual matching step defined, we can now write the complete loop that processes all ancestors from oldest to youngest.

Step 4: The Complete Ancestor Matching Loop

Putting it all together:

def match_ancestors(ancestors, inference_sites, positions,
                    recombination_rate, mismatch_ratio,
                    sequence_length):
    """Run the complete ancestor matching phase.

    Parameters
    ----------
    ancestors : list of dict
        Ancestors sorted by time (oldest first). First is the ultimate
        ancestor.
    inference_sites : ndarray of int
        Inference site positions.
    positions : ndarray of float
        Genomic positions of inference sites.
    recombination_rate : float
        Per-unit recombination rate.
    mismatch_ratio : float
        Mismatch-to-recombination ratio.
    sequence_length : float
        Total sequence length.

    Returns
    -------
    builder : TreeSequenceBuilder
        The constructed tree sequence.
    """
    m = len(inference_sites)
    builder = TreeSequenceBuilder(sequence_length, m, positions)

    # Phase 1: Add the ultimate ancestor as root (the mainplate)
    ultimate = ancestors[0]
    root_id = builder.add_node(time=ultimate['time'])
    ultimate['node_id'] = root_id

    placed = [ultimate]

    # Phase 2: Process remaining ancestors by time groups
    groups = matching_order(ancestors[1:])

    for group_idx, group in enumerate(groups):
        # Build reference panel from all placed (older) ancestors
        panel, ref_node_ids = build_reference_panel(placed, m)
        k = len(ref_node_ids)

        # Compute HMM parameters for this panel size
        rho = np.zeros(m)
        mu = np.zeros(m)
        for ell in range(1, m):
            d = positions[ell] - positions[ell - 1]
            rho[ell] = 1 - np.exp(-d * recombination_rate / max(k, 1))
            mu[ell] = 1 - np.exp(-d * recombination_rate * mismatch_ratio
                                  / max(k, 1))
        mu[0] = mu[1] if m > 1 else 1e-6

        # Match each ancestor in this group against the panel
        for anc in group:
            node_id = builder.add_node(time=anc['time'])
            anc['node_id'] = node_id

            # Build the query (ancestor's haplotype over its interval)
            query = np.full(m, -1, dtype=int)  # -1 = undefined
            query[anc['start']:anc['end']] = anc['haplotype']

            # Run Viterbi (only over the ancestor's interval)
            start, end = anc['start'], anc['end']
            if end - start < 2:
                # Too short for HMM -- just parent to root
                left = positions[start]
                right = positions[end - 1] + 1
                builder.edges.append((left, right, root_id, node_id))
            else:
                sub_query = query[start:end]
                sub_panel = panel[start:end]
                sub_rho = rho[start:end]
                sub_mu = mu[start:end]
                sub_rho[0] = 0.0  # No recombination at first site

                # Only use columns that have copiable entries
                copiable_cols = []
                for col in range(k):
                    if np.any(sub_panel[:, col] != NONCOPY):
                        copiable_cols.append(col)

                if len(copiable_cols) == 0:
                    # No references available -- parent to root
                    left = positions[start]
                    right = positions[end - 1] + 1
                    builder.edges.append((left, right, root_id, node_id))
                else:
                    sub_panel_c = sub_panel[:, copiable_cols]
                    sub_ref_ids = [ref_node_ids[c] for c in copiable_cols]
                    # Use viterbi_ls_with_noncopy from Gear 2
                    path = viterbi_ls_with_noncopy(
                        sub_query, sub_panel_c, sub_rho, sub_mu)
                    # Map path back to node IDs
                    mapped_path = np.array([copiable_cols[p]
                                            for p in path])
                    builder.add_edges_from_path(
                        mapped_path, node_id,
                        ref_node_ids=ref_node_ids)

            # This ancestor is now placed and available for future groups
            placed.append(anc)

        print(f"  Group {group_idx}: time={group[0]['time']:.2f}, "
              f"matched {len(group)} ancestors, "
              f"panel size={k}")

    return builder

After matching, the raw tree sequence may contain polytomies – multiple children sharing the same parent over the same interval. Path compression resolves these into a more refined topology.

Step 5: Path Compression

After matching, the tree sequence often contains many edges that share the same parent-child-interval pattern. Path compression merges these redundant edges through synthetic path compression (PC) nodes.

The problem

Consider two ancestors \(a_1\) and \(a_2\) that both copy from ancestor \(a_0\) over the same genomic interval \([l, r)\). In the raw tree sequence, we have two edges:

Edge: [l, r) parent=a0, child=a1
Edge: [l, r) parent=a0, child=a2

If \(a_1\) and \(a_2\) are siblings in the true tree (they coalesce before reaching \(a_0\)), we’re missing the intermediate coalescent node. The tree has a polytomy (multiple children sharing one parent) that should be resolved.

The solution

Path compression inserts a synthetic node \(c\) between \(a_0\) and its children:

Edge: [l, r) parent=a0, child=c
Edge: [l, r) parent=c,  child=a1
Edge: [l, r) parent=c,  child=a2

The PC node \(c\) is assigned a time slightly less than \(a_0\):

\[t_c = t_{a_0} - \frac{1}{2^{32}}\]

This epsilon offset ensures the node is strictly younger than its parent, which is required for a valid tree sequence.

Probability Aside – Path Compression and Tree Topology

Path compression doesn’t change the data likelihood of the tree sequence – the mutation patterns are identical whether we have a polytomy or a resolved binary split. But it can change downstream inferences. A polytomy says “we don’t know the order of coalescence among these lineages,” while a resolved binary tree implies a specific order. Path compression is a heuristic: it assumes that if multiple children share the same parent over the same interval, they likely coalesced together just below that parent. This is often (but not always) correct. The epsilon time offset means the PC node has no meaningful biological time – it is a topological device.

PC_TIME_EPSILON = 1.0 / (2**32)

def path_compress(edges, nodes):
    """Apply path compression to a set of edges.

    Parameters
    ----------
    edges : list of (left, right, parent, child)
        Raw edges from matching.
    nodes : list of dict
        Node information.

    Returns
    -------
    new_edges : list of (left, right, parent, child)
        Compressed edges.
    new_nodes : list of dict
        Updated nodes (may include new PC nodes).
    """
    from collections import defaultdict

    # Group edges by (left, right, parent) to find shared patterns
    groups = defaultdict(list)
    for left, right, parent, child in edges:
        groups[(left, right, parent)].append(child)

    new_edges = []
    new_nodes = list(nodes)
    next_id = max(n['id'] for n in nodes) + 1

    for (left, right, parent), children in groups.items():
        if len(children) <= 1:
            # Only one child -- no compression needed
            for child in children:
                new_edges.append((left, right, parent, child))
        else:
            # Multiple children share the same parent and interval
            # Insert a PC node between parent and children
            parent_time = None
            for n in nodes:
                if n['id'] == parent:
                    parent_time = n['time']
                    break

            # PC node sits just below the parent in time
            pc_time = parent_time - PC_TIME_EPSILON
            pc_node = {'id': next_id, 'time': pc_time,
                       'is_sample': False}
            new_nodes.append(pc_node)

            # Parent -> PC node (single edge replaces multiple)
            new_edges.append((left, right, parent, next_id))

            # PC node -> each child
            for child in children:
                new_edges.append((left, right, next_id, child))

            next_id += 1

    return new_edges, new_nodes

# Example
edges_raw = [
    (0, 5000, 0, 1),
    (0, 5000, 0, 2),
    (0, 5000, 0, 3),
    (5000, 10000, 0, 1),
    (5000, 10000, 1, 4),
]
nodes_raw = [
    {'id': 0, 'time': 1.0, 'is_sample': False},
    {'id': 1, 'time': 0.8, 'is_sample': False},
    {'id': 2, 'time': 0.6, 'is_sample': False},
    {'id': 3, 'time': 0.4, 'is_sample': False},
    {'id': 4, 'time': 0.2, 'is_sample': True},
]

compressed_edges, compressed_nodes = path_compress(edges_raw, nodes_raw)
print("Original edges:")
for e in edges_raw:
    print(f"  [{e[0]}, {e[1]}): {e[2]} -> {e[3]}")
print(f"\nCompressed edges:")
for e in compressed_edges:
    print(f"  [{e[0]}, {e[1]}): {e[2]} -> {e[3]}")
print(f"\nNew PC nodes:")
for n in compressed_nodes[len(nodes_raw):]:
    print(f"  Node {n['id']}: time={n['time']:.10f}")

Step 6: Verification

After ancestor matching, we can verify the ancestor tree sequence:

def verify_ancestor_tree(builder):
    """Verify basic properties of the ancestor tree sequence."""
    print("Ancestor tree verification:")

    # 1. Every non-root node has at least one parent edge
    root_id = 0  # Ultimate ancestor
    children_seen = set()
    for left, right, parent, child in builder.edges:
        children_seen.add(child)
    non_root_nodes = {n['id'] for n in builder.nodes if n['id'] != root_id}
    orphans = non_root_nodes - children_seen
    print(f"  [{'ok' if len(orphans) == 0 else 'FAIL'}] "
          f"All non-root nodes have parent edges "
          f"(orphans: {len(orphans)})")

    # 2. No self-loops (a node cannot be its own parent)
    self_loops = [(l, r, p, c) for l, r, p, c in builder.edges
                  if p == c]
    print(f"  [{'ok' if len(self_loops) == 0 else 'FAIL'}] "
          f"No self-loops")

    # 3. Parent time > child time for all edges
    time_map = {n['id']: n['time'] for n in builder.nodes}
    bad_times = []
    for left, right, parent, child in builder.edges:
        if time_map.get(parent, 0) <= time_map.get(child, 0):
            bad_times.append((parent, child))
    print(f"  [{'ok' if len(bad_times) == 0 else 'FAIL'}] "
          f"Parent time > child time for all edges "
          f"(violations: {len(bad_times)})")

    # 4. Summary statistics
    print(f"\n  Nodes: {len(builder.nodes)}")
    print(f"  Edges: {len(builder.edges)}")
    print(f"  Time range: [{min(time_map.values()):.4f}, "
          f"{max(time_map.values()):.4f}]")

With the ancestor tree built and verified, we have the scaffold of the genealogy – the movement’s mainplate, bridges, and wheel train are in place. The final chapter, Gear 4, will thread the actual samples through this scaffold and polish the result into a finished tree sequence.

Exercises

Exercise 1: Panel growth analysis

As ancestors are matched, the reference panel grows. Plot the panel size \(k\) as a function of the time group index. How does this affect the computational cost per ancestor?

Exercise 2: Edge count analysis

For a simulated dataset (use msprime with \(n = 100\), \(L = 10^5\)), run ancestor generation and matching. How many edges are created? How does this compare to the number of edges in the true tree sequence from the simulation?

Exercise 3: Path compression impact

Compare the tree sequence before and after path compression. How many PC nodes are created? What fraction of edges are affected? Visualize a local tree before and after compression.

Next: Gear 4: Sample Matching & Post-Processing – threading the actual samples through the ancestor tree.

Solutions

Solution 1: Panel growth analysis

We simulate a dataset, generate ancestors, group them by time, and track how the reference panel size \(k\) grows as each time group is processed.

import numpy as np
import matplotlib.pyplot as plt

# Simulate a variant matrix
np.random.seed(42)
n, m = 100, 200
D = np.random.binomial(1, 0.3, size=(n, m))
ancestral_known = np.ones(m, dtype=bool)

# Generate ancestors (using functions from Gear 1)
ancestors, inference_sites = generate_ancestors(D, ancestral_known)
ancestors = add_ultimate_ancestor(ancestors, len(inference_sites))

# Group by time and track panel growth
groups = matching_order(ancestors[1:])  # Exclude the ultimate ancestor
panel_sizes = []
cumulative = 1  # Start with just the ultimate ancestor

for group_idx, group in enumerate(groups):
    panel_sizes.append(cumulative)
    cumulative += len(group)

group_indices = np.arange(len(panel_sizes))
group_times = [g[0]['time'] for g in groups]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(group_indices, panel_sizes, 'o-')
ax1.set_xlabel("Time group index (oldest first)")
ax1.set_ylabel("Panel size k")
ax1.set_title("Reference Panel Growth During Matching")

ax2.plot(group_times, panel_sizes, 'o-')
ax2.set_xlabel("Time (derived allele frequency)")
ax2.set_ylabel("Panel size k")
ax2.set_title("Panel Size vs. Ancestor Time")
ax2.invert_xaxis()  # Oldest (highest freq) on the left

plt.tight_layout()
plt.show()

# Computational cost analysis
print("Panel size at each time group:")
for i, (k, t) in enumerate(zip(panel_sizes, group_times)):
    group_size = len(groups[i])
    cost = k * group_size  # O(k) per ancestor, times group size
    print(f"  Group {i}: time={t:.2f}, k={k}, "
          f"ancestors={group_size}, cost ~ {cost}")

Key observations:

  • The panel grows monotonically: each time group adds its ancestors to the panel for subsequent groups. The growth is roughly linear in the number of groups.

  • Computational cost per ancestor is \(O(mk)\) where \(m\) is the number of inference sites and \(k\) is the current panel size. Since \(k\) is smallest for the oldest ancestors and largest for the youngest, the youngest ancestors are the most expensive to match.

  • The total cost across all ancestors is \(\sum_{g=0}^{G-1} |g| \cdot k_g \cdot m\), where \(|g|\) is the group size and \(k_g\) is the panel size at group \(g\). This sums to roughly \(O(A^2 m / 2)\) where \(A\) is the total number of ancestors, dominated by the last (youngest) groups.

Solution 2: Edge count analysis

We compare the number of edges produced by ancestor matching against the number of edges in the true (simulated) tree sequence.

import msprime
import numpy as np

# Simulate a tree sequence
ts = msprime.simulate(
    sample_size=100, Ne=10_000, length=1e5,
    mutation_rate=1e-8, recombination_rate=1e-8,
    random_seed=42)

print(f"True tree sequence:")
print(f"  Nodes: {ts.num_nodes}")
print(f"  Edges: {ts.num_edges}")
print(f"  Trees: {ts.num_trees}")
print(f"  Mutations: {ts.num_mutations}")

# Build the variant matrix
G = ts.genotype_matrix()  # (num_sites, num_samples)
D = G.T  # (n, m)
n, m = D.shape
positions = np.array([s.position for s in ts.sites()])
ancestral_known = np.ones(m, dtype=bool)

# Run ancestor generation
ancestors, inference_sites = generate_ancestors(D, ancestral_known)
ancestors = add_ultimate_ancestor(ancestors, len(inference_sites))
inf_positions = positions[inference_sites]

# Run ancestor matching
builder = match_ancestors(
    ancestors, inference_sites, inf_positions,
    recombination_rate=1e-8, mismatch_ratio=1.0,
    sequence_length=ts.sequence_length)

print(f"\nInferred ancestor tree:")
print(f"  Ancestor nodes: {len(builder.nodes)}")
print(f"  Edges: {len(builder.edges)}")

# Each Viterbi transition creates a new edge, so the edge count
# is approximately: num_ancestors + total_recombination_breakpoints
num_ancestors = len(ancestors) - 1  # Exclude ultimate
avg_edges_per_ancestor = len(builder.edges) / max(num_ancestors, 1)
print(f"\n  Ancestors matched: {num_ancestors}")
print(f"  Avg edges per ancestor: {avg_edges_per_ancestor:.1f}")
print(f"  Ratio inferred/true edges: "
      f"{len(builder.edges) / ts.num_edges:.2f}")

Key observations:

  • The inferred tree typically has more edges than the true tree sequence because: (a) every ancestor creates at least one edge, and (b) Viterbi recombination breakpoints may not align perfectly with the true breakpoints, introducing extra edge boundaries.

  • After simplification (which removes ancestors not ancestral to any sample), the edge count drops substantially and approaches the true edge count.

  • The ratio of inferred-to-true edges depends on the recombination rate and sample size. Higher recombination creates more breakpoints in both the true and inferred trees.

Solution 3: Path compression impact

We apply path compression to the raw ancestor tree edges and measure how many PC nodes are created and how many edges are affected.

import numpy as np
from collections import defaultdict

# Use the example from the chapter or build from a simulation
# Here we construct a scenario with deliberate polytomies
edges_raw = [
    (0, 5000, 0, 1),     # Three children of node 0 on [0, 5000)
    (0, 5000, 0, 2),
    (0, 5000, 0, 3),
    (5000, 10000, 0, 1), # Two children of node 0 on [5000, 10000)
    (5000, 10000, 0, 4),
    (0, 10000, 1, 5),    # Single child -- not compressed
    (0, 10000, 1, 6),    # Two children of node 1 on [0, 10000)
]
nodes_raw = [
    {'id': i, 'time': 1.0 - i * 0.1, 'is_sample': i >= 5}
    for i in range(7)
]

# Apply path compression
compressed_edges, compressed_nodes = path_compress(edges_raw, nodes_raw)

# Analysis
original_edge_count = len(edges_raw)
compressed_edge_count = len(compressed_edges)
pc_nodes = compressed_nodes[len(nodes_raw):]
num_pc_nodes = len(pc_nodes)

print(f"Before compression:")
print(f"  Edges: {original_edge_count}")
print(f"  Nodes: {len(nodes_raw)}")
for e in edges_raw:
    print(f"    [{e[0]}, {e[1]}): {e[2]} -> {e[3]}")

print(f"\nAfter compression:")
print(f"  Edges: {compressed_edge_count}")
print(f"  Nodes: {len(compressed_nodes)}")
print(f"  PC nodes added: {num_pc_nodes}")
for e in compressed_edges:
    print(f"    [{e[0]}, {e[1]}): {e[2]} -> {e[3]}")

# Count affected edges: edges in groups with 2+ children
groups = defaultdict(list)
for l, r, p, c in edges_raw:
    groups[(l, r, p)].append(c)

affected_original = sum(len(children) for children in groups.values()
                        if len(children) > 1)
print(f"\n  Edges in polytomies (before): {affected_original}")
print(f"  Fraction of edges affected: "
      f"{affected_original / original_edge_count:.1%}")

# Verify: total edges after = sum over groups of
# (1 parent->PC + |children| PC->child) for multi-child groups
# plus unchanged edges for single-child groups
expected = sum(
    1 + len(ch) if len(ch) > 1 else 1
    for ch in groups.values()
)
assert compressed_edge_count == expected, \
    f"Expected {expected}, got {compressed_edge_count}"
print(f"  Edge count verification: [ok]")

Key observations:

  • Each polytomy (a parent with \(c > 1\) children over the same interval) generates one PC node. The \(c\) original edges are replaced by \(c + 1\) edges (one parent-to-PC, \(c\) PC-to-children). So path compression increases the total edge count.

  • The number of PC nodes is exactly equal to the number of distinct (left, right, parent) triples that have more than one child.

  • Path compression resolves polytomies into binary-ish structure, which is important for downstream analysis (e.g., tsdate expects resolved tree topologies). The epsilon time offset \(1/2^{32}\) ensures valid parent-child time ordering without implying a biologically meaningful coalescence time for PC nodes.