Tutorial

Sampling ARGs

As a simple example, we will first simulate sample data with msprime. We will then run arginfer on the simulated dataset.

The following code simulates a tree sequence and the sequences for a sample size of 10 and sequence length of 1e5.

import msprime
import os
ts_full = msprime.simulate(sample_size=10, Ne=5000,
                                        length=1e5,
                                        mutation_rate=1e-8,
                                        recombination_rate=1e-8,
                                        record_full_arg= True,
                                        random_seed=2)
os.makedirs(os.getcwd()+"/out")
ts_full.dump(os.getcwd()+"/out/"+"ts_full.args")

The output of this code is a tree sequence stored in "out/" directory under the name of ts_full.args.

Next, the following command can be used to run 200 MCMC iterations with burn-in 5 and retaining every 10 samples (thinning intervals = 10). Also sample_size = n = 10 is the number of sequences each seq_length = L = 1e5 in length evolving in a population of effective size Ne = 5000, with mutation rate 1e-8 mutations/generation/site and recombination rate 1e-8 recombinations/generation/site.

import arginfer
arginfer.infer_sim(
    ts_full = "out/ts_full.args",     # path to simulated ts
    sample_size =10,            # sample size
    iteration= 200,              # number of mcmc iterations
    thin= 10,                    # thinning interval, retaining everry kth sample
    burn=5,                     # burn-in period to discard
    Ne =5000,                   # effective population size
    seq_length= 1e5,            # sequence length in bases
    mutation_rate=1e-8,         # mutation rate per site per generation
    recombination_rate=1e-8,    # recombination rate per site per generation
    outpath = os.getcwd()+"/output",   # output path
    plot = True)                    # plot traces

or equivalently in terminal:

arginfer infer --tsfull "out/ts_full.args" \
    -I 200 --thin 10 -b 5 \
    -n 10 -L 1e5 --Ne 5000 \
    -r 1e-8 -mu 1e-8 \
    -O output \
    --plot

The output of the above command is as follows:

  • summary.h5: A summary of some ARG properties recorded in a pandas dataframe with columns:

pd.DataFrame(columns=('likelihood', 'prior', "posterior",
                                         'ancestral recomb', 'non ancestral recomb',
                                            'branch length'))
  • .arg file: The sampled ARGs, which are pickled ATS objects.

    • See here for more information on how manipulate these files (TODO).

  • arginfer*.pdf: if plot=True, this pdf file will be generated which contains trace plots for
    the log(posterior), ARG total branch length, number of ancestral recombinations,
    and number of non-ancestral recombinations.

Working with arginfer outputs