This document provides a reference to the ercs Python module, which provides a straightforward interface to coalescent simulations of the extinction/recolonisation model. See [E08], [BEV10], [BKE10] and [BEV12]
Simulating the coalescent for the extinction/recolonisation model using ercs follows a basic pattern:
In the following examples we look at the parameters of the simulation, the structure of the simulated genealogies (and how we can analyse them) and how we can use these tools to estimate values of interest.
To simulate the history of a set of individuals at a sample of locations on a 2D torus, we first allocate an instance of the ercs.Simulator class. This class has a number of attributes which can be set to describe the parameters of the desired simulation. Most of these parameters have sensible defaults, but we must specify at least three of these before we can run a simulation. Here is a simple example:
import ercs
def first_example(seed):
sim = ercs.Simulator(20)
sim.sample = {1:(0, 0), 2:(0, 5), 3:(0, 10)}
sim.event_classes = [ercs.DiscEventClass(u=0.5, r=1)]
return sim.run(seed)
In this example we allocate a simulator on a torus of diameter 20, set up our sample and event classes, and the run the simulation returning the resulting genealogy. The size of the torus is rather arbitrary, as the real size of the habitat that we imagine our population evolving on is determined by the scale of events relative to the size of the torus. Thus, the size of the torus can be any value you wish, once the spatial extent of events is scaled appropriately. For the following examples we’ll tend to use rather large events, as it’s useful to have examples that run quickly. These are very unrealistic evolutionary scenarios.
The initial locations of the lineages whose ancestry we wish to simulate are specified using the ercs.Simulator.sample attibute. These are 2-tuples describing locations in two dimensions on the torus. Here, we simulate the history of three locations, (0, 0), (0, 5) and (0, 10).
Before we can simulate the history of this sample, we must describe the model under which we imagine the population has evolved. This is done by allocating some objects that describe the type of events that we are interested in, and assigning these to the ercs.Simulator.event_classes attribute. In the example above, we state that all events in the simulation are from the Disc model, and they have radius r = 1 and impact u = 0.5. There can be any number of event classes happening at different rates: see Event Classes for details.
After we have completed setting up the parameters of the simulation we can then run the simulation by calling the ercs.Simulator.run() method for a given random seed. This returns the simulated history of the sample.
The most part of ercs to understand is the way in which we encode genealogies. Running the example above, we get
>>> first_example(3)
([[-1, 4, 4, 5, 5, 0]], [[-1, 0.0, 0.0, 0.0, 30441.574004183603, 46750.11224375103]])
(Note there is nothing special about the seed 3 here—it is just a value which produced a neat example to discuss). This output completely describes the ancestry of the sample, although it’s not immediately obvious how. In ercs we use oriented trees to represent the genealogy of a sample. In an oriented tree, we are only interested in the parent-child relationships between nodes, and don’t care about the order of children at a node. Therefore, in an oriented tree pi, the parent of node j is pi[j]. If we map each node in the tree to a unique positive integer and adopt the convention that any node whose parent is the special “null node” 0 is a root, we can then represent an oriented tree very simply as a list of integers.
In our example above, we have a list of three locations as our sample, and so we map these to the integers 1, 2 and 3 (i.e., lineage 1 is sampled at location (0, 0) and so on). The ercs.Simulator.run() method returns a tuple, (pi, tau); pi is a list of oriented forests (one for each locus) and tau is a list of node times (one for each locus). In the example, we are dealing with a single locus only, so pi is a list consisting of one list, [-1, 4, 4, 5, 5, 0], that encodes the following tree:
It may be easier to see this if we explicity map the nodes to their parents:
>>> pi, tau = first_example(3)
>>> [(node, pi[0][node]) for node in range(1, len(pi[0]))]
[(1, 4), (2, 4), (3, 5), (4, 5), (5, 0)]
Note
The zero’th element of an oriented forest and its associated node time list is not used and is set to -1 by convention, following Knuth (Algorithm O, section 7.2.1.6) [K11].
The times labelled on the tree are derived from the node times list for this locus, tau[0]. The node times list associated with an oriented tree records the time that the associated lineage entered the sample, looking backwards in time (hence, for each node in the sample the time is 0.0).
Oriented forests occur when there is more than one root in a list pi, and so we have a set of disconnected trees. This can happen when we specify the max_time attribute, potentially stopping the simulation before the sample has completely coalesced. Consider the following example:
def oriented_forest_example(seed):
L = 20
sim = ercs.Simulator(L)
sim.event_classes = [ercs.DiscEventClass(u=0.5, r=1)]
sim.sample = [(j, j) for j in range(10)]
sim.max_time = 1e5
pi, tau = sim.run(seed)
return pi[0]
Here we allocate a Simulator on a torus of diameter 20 as before and use the usual event class. This time we allocate a sample of size 10, arranged regularly along a line, and stipulate that the simulation should continue for no more then 10000 time units. As we’re only interested in the structure of the genealogy this time, we just return the oriented forest at the first locus. Running this, we get
>>> oriented_forest_example(5)
[-1, 0, 15, 0, 12, 12, 13, 11, 13, 11, 16, 16, 14, 14, 15, 0, 0]
This corresponds to the forest:
In this forest there are four roots: 1, 3, 15 and 16.
Note
This forest is not a correct representation of the node times; in any simultation, node n + 1 cannot be more recent than node n.
The most important quantity in coalescent simulations is the coalescence time for a set of individuals, or the time back to their most recent common ancestor (MRCA). This is straightforward to do in ercs using the ercs.MRCACalculator class to find the most recent common ancestor of two nodes and then looking up the node times list to find the time that this node entered the sample.
Suppose we wished to find the coalescence time for a set of lineages sampled at a regular set of distances. We can set our sample so that the lineages are located at the relevent distances, but it’s not clear how we can get:
def mrca_example(seed):
L = 40
sim = ercs.Simulator(L)
sim.sample = [None] + [(0, j) for j in range(1, 10)]
sim.event_classes = [ercs.DiscEventClass(u=0.5, r=1)]
pi, tau = sim.run(seed)
sv = ercs.MRCACalculator(pi[0])
for j in range(2, 10):
mrca = sv.get_mrca(1, j)
coal_time = tau[0][mrca]
distance = ercs.torus_distance(sim.sample[1], sim.sample[j], L)
print(distance, "\t", coal_time)
Running this
>>> mrca_example(2)
1.0 293.516221072
2.0 1240.09792256
3.0 1505.42515816
4.0 247645.676128
5.0 247645.676128
6.0 263761.554043
7.0 263761.554043
8.0 263761.554043
Dealing with multiple loci.
The most common use of coalescent simulation is to estimate the distribution of some quantity by aggregating over many different replicates. This is done in ercs by running the run method with different random seeds, one for each replicate. Since each replicate is then completely independant, we can easily parallise the process. One possible way to this is using the multiprocessing module:
import ercs
import multiprocessing
def parallel_run(seed):
sim = ercs.Simulator(50)
sim.sample = [(1, 1), (2, 2)]
sim.event_classes = [ercs.DiscEventClass(u=0.5, r=1)]
pi, tau = sim.run(seed)
coal_time = tau[0][3]
return coal_time
def parallel_example(num_replicates):
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
coal_times = pool.map(parallel_run, range(1, num_replicates + 1))
return sum(coal_times) / num_replicates
In this example we are working on a torus of diameter 100, so the simulations require a lot longer to run. On most modern systems we have many CPU cores, and so we use the multiprocessing module to distribute the work of many replicates across these cores.
>>> parallel_example(100)
2968953.9276501946
This is the mean coalescence time among 100 replicates. The multiprocessing module runs parallel_run function for each of the seeds in a subprocess and collects the coalescence times into the list coal_times. We then take the mean of this list and return it. The random seeds are simply the integers from 1 to 100. This is a perfectly legitimate way to choose seeds for a single example, since the random sequences for adjacent seeds are not correlated. If, however, we are are doing lots of simulations with different parameter values or are distributing our simulations over several machines, it would be better to spread our choice of seeds out more evenly across the possible space. One way to do this is:
import random
seeds = [random.randint(1, 2**31 - 1) for j in range(100)]
There is no issue with using the Python random generator within your code, as the ercs C library generates random numbers independantly of Python (using the mt19937 random number generator from the GNU Scientific Library).
Simulate the coalescent in the extinction recolonisation model under a flexible set of parameters.
The classes of event in a given simulation are specified by providing a list of Event Class instances in the ercs.Simulator.event_classes attribute. Two classes of event are currently supported: the Disc model and the Gaussian model. See [E08], [BEV10], [BEV12] and several other articles for details of the Disc model, and see [BKE10] and [BEV12] for details of the Gaussian model.
Class representing a coalescent simulator for the extinction/recolonisation model on a torus of the specified diameter.
The class provides a convenient interface to the low-level _ercs module, and contains instance variables for each of the simulation parameters with sensible defaults. These parameters should be set before calling simulate.
The location of lineages at the beginning of the simulation. This must be a non-empty list of two-tuples describing locations within the 2D space defined by the torus.
Default value: None.
The event classes to simulate. This must be a list of ercs.EventClass instances. There must be at least one event class specified.
Default value: None.
The diameter of the torus we are simulating on. This defines the size of the space that lineages can move around in.
Default value: Specified at instantiation time.
The number of parents in each event. For a single locus simulation there must be at least one parent and for multi-locus simulations at least two.
Default value: 1 if the simulation is single locus, otherwise 2.
The list of inter-locus recombination probabilities; the length of this list also determines the number of loci for each individual. At an event, the probability of locus j and j + 1 descending from different parents is recombination_probablities[j]. The number of loci in the simulation is therefore len(recombination_probablities) + 1.
Default value: The empty list [] (so, we have a single locus simulation by default).
The maximum amount of time (in simulation units) that we simulate. If this is set to 0.0 the simulation continues until all loci have coalesced.
Default value: 0.0
The maximum amount of memory used for tracking lineages in MiB (i.e., 2^20 bytes). If the number of lineages we are tracking grows so much that we exceed this limit, the simulation aborts and raises an _ercs.LibraryError. This is an only an approximate limit on the total amount of memory used by the simulation.
Default value: 32 MiB
The number of locations in a leaf node of the kdtree; must be a power of two, greater than 0. The kdtree_bucket_size is an advanced parameter that may be useful in tuning preformance when very large numbers of lineages are involved. Larger values will result in less time and memory spent indexing the lineages, but more lineages will need to be tested to see if they are within the critical radius of the event. Note: changing this parameter affects the outcome of simulations! That is, if we change the value of the bucket size, we cannot expect the outcome of two simulations with the same random seed to be the same. The reason for this is that, although we are guaranteed to end up with the same set of lineages in an event in any case, the order in which they die may be different, pushing the simulation onto a different stochastic trajectory.
Default value: 1
The maximum number of insertions into the kdtree before a rebuild, or 0 if the tree is not to be rebuilt. This parameter is useful for tuning the performance of the simulation when we have large numbers of loci, particularly if we begin with a relatively small sample. In this case, as the number of lineages increases over time and they spread outwards to cover more and more of the torus, we need to rebuild the index periodically. If we begin with a large sample uniformly distributed around the space then this can safely be set to 0.
Default value: 0
Runs the coalescent simulation for the specified random seed, and returns the simulated history, (pi, tau). The history consists of a list of oriented forests (one for each locus) and their corresponding node times (one for each locus).
Parameter: | random_seed (integer.) – the value to initialise the random number generator |
---|---|
Returns: | the simulated history of the sample, (pi, tau) |
Return type: | a tuple (pi, tau); pi is a list of lists of integers, and tau is a list of lists of doubles |
Raises: | _ercs.InputError when the input is not correctly formed |
Raises: | _ercs.LibraryError when the C library encounters an error |
Returns the Euclidean distance between two points x and y on a 2D square torus with diameter L.
Parameters: |
|
---|---|
Return type: | floating point value |
Class to that allows us to compute the nearest common ancestor of arbitrary nodes in an oriented forest.
This is an implementation of Schieber and Vishkin’s nearest common ancestor algorithm from TAOCP volume 4A, pg.164-167 [K11]. Preprocesses the input tree into a sideways heap in O(n) time and processes queries for the nearest common ancestor between an arbitary pair of nodes in O(1) time.
Parameter: | oriented_forest (list of integers) – the input oriented forest |
---|
Returns the most recent common ancestor of the nodes x and y, or 0 if the nodes belong to different trees.
Parameters: |
|
---|---|
Returns: | the MRCA of nodes x and y |
Type: | non-negative integer |
The ercs module delegates the majority of its work to the low-level _ercs extension module, which is written in C. It is not recommended to call this module directly - the ercs module provides all of the functionality with a much more straightforward interface. In the interested of completeness, however, the low-level module is documented here.
In the _ercs module, event classes are specified by dictionaries of key-value pairs describing the rate events of a particular class happen, the type of event and the parameters unique to each event class. Each dictionary must have two fields: rate and type. The rate field specifies the rate that this class of events happens at and is a float. The type field specifies the type of events. The supported event classes are:
Allocates an ercs object from the C library, calls the simulate function and returns the resulting genealogy. All arguments must be specified and be in the correct order.
Parameters: |
|
---|---|
Returns: | the simulated history of the sample, (pi, tau). |
Return type: | a tuple (pi, tau); pi is a list of lists of integers, and tau is a list of lists of doubles |
Raises: | InputError when the input is not correctly formed. |
Raises: | LibraryError when the C library encounters an error |
[E08] | (1, 2) A. Etheridge. Drift, draft and structure: some mathematical models of evolution, Banach Center Publications 80, pp 121–144, 2008. |
[BEV10] | (1, 2) N. H. Barton, A. M. Etheridge and A. Véber. A new model for evolution in a spatial continuum, Electronic Journal of Probability 15:7, 2010. |
[BKE10] | (1, 2) N. H. Barton, J. Kelleher and A. M. Etheridge. A new model for extinction and recolonisation in two dimensions: quantifying phylogeography, Evolution, 64(9), pp 2701–2715, 2010. |
[BEV12] | (1, 2, 3) N. H. Barton, A. M. Etheridge and A. Véber. Modelling Evolution in a Spatial continuum, J. Stat. Mech., to appear, 2012. |
[K11] | (1, 2) D. E. Knuth, Combinatorial Algorithms, Part 1; Volume 4A of The Art of Computer Programming, 2011. |