Pipeline

Note

This section describes how a standard phyddle pipeline analysis is configured and how settings determine the behavior of a phyddle analysis. Visit Configuration to learn how to assign settings for a phyddle analysis. Visit Glossary to learn more about how phyddle defines different terms

_images/phyddle_pipeline.png

A phyddle pipeline analysis has five steps: Simulate, Format, Train, Estimate, and Plot. Standard analyses run all steps, in order for a single batch of settings. That said, steps can be run multiple times under different settings and orders, which is useful for exploratory and advanced analyses. Visit Tricks to learn how to use phyddle to its fullest potential.

All pipeline steps create output files. All pipeline (except Simulate) also require input files corresponding to at least one other pipeline step. A full phyddle analysis for a project will automatically generate the input files for downstream pipeline steps and store them in a predictable project directory.

Users may also elect to use phyddle for only some steps in their analysis, and produce files for other steps by different means. For example, Format expects to format and combine large numbers of simulated datasets into tensor formats that can be used for supervised learning with neural networks. These simulated files can either be generated through phyddle with the Simulate step or outside of phyddle entirely.

Below is the project directory structure that a standard phyddle analysis would use. In general, we assume the project name is example:

Simulate
- input:   None
- output:  workspace/simulate/example  # simulated datasets

Format
- input:   workspace/simulate/example  # simulated datasets
- output:  workspace/format/example    # formatted datasets

Train
- input:   workspace/format/example    # simulated training dataset
- output:  workspace/train/example     # trained network + results

Estimate
- input:   workspace/format/example    # simulated test dataset
           workspace/train/example     # trained network
           workspace/estimate/example  # new (emprical) dataset
- output:  workspace/estimate/example  # new (empirical) estimates

Plot
- input:   workspace/format/example    # simulated training dataset
           workspace/train/example     # trained network and output
           workspace/estimate/example  # new (empirical) dataset & estimates
- output:  workspace/plot/example      # analysis figures

Simulate

Simulate instructs phyddle to simulate your training dataset. Any simulator that can be called from command-line can be used to generate training datasets with phyddle. This allows researchers to use their favorite simulator with phyddle for phylogenetic modeling tasks.

As a worked example, suppose we have an R script called sim_one.R containing the following code

# load library
library(ape)

# gather arguments
args = commandArgs(trailingOnly = TRUE)

# simulated file names
tmp_fn = args[1]
phy_fn  = paste0(tmp_fn, ".tre")
dat_fn  = paste0(tmp_fn, ".dat.nex")
lbl_fn  = paste0(tmp_fn, ".param_row.csv")

# simulation parameters
birth = rexp(1)
death = birth * runif(1)
rate = rexp(1)
max_time = runif(1,0,10)

# simulate training data
phy = rbdtree(birth=birth, death=death, Tmax=max_time)
dat = rTraitDisc(phy, model="ER", k=2, rate=rate, state_labels=c(0,1))
dat = dat - 1  # re-index states from 1/2 to 0/1

# collect training labels
lbl_vec = c(birth=birth, death=death, rate=rate)
lbl = data.frame(t(lbl_vec))

# save training example
write.tree(phy, file=phy_fn)
write.nexus.data(dat, file=dat_fn, format="standard", datablock=T)
write.csv(lbl, file=lbl_fn, row.names=F, quote=F)

# done!
quit()

This script has a few important features. First, the simulator is entirely reponsible for simulating the dataset. Second, the script assumes it will be provided a runtime argument (args[1]) to generate filenames for the training example. Third, output for the Newick string is stored into a .tre file, for the character matrix data into a .dat.nex Nexus file, and for the training labels into a comma-separated .csv file.

Now that we understand thee script, we need to configure phyddle to call it properly. This is done by setting the sim_command argumetn equal to a command string of the form MY_COMMAND [MY_COMMAND_ARGUMENTS]. During simulation, phyddle executes the command string against different filepath locations. More specifically, phyddle will execute the command MY_COMMAND [MY_COMMAND_ARGUMENTS] SIM_PREFIX, where SIM_PREFIX contains the beginning of the filepath locating for an individual simulated dataset. As part of the Simulate step, phyddle will execute the command string against a range of values of SIM_PREFIX generates the complete simulated dataset of replicated training examples.

The correct sim_command is:

'sim_command' : 'Rscript sim_one.R'

Assuming sim_dir = ../workspace/simulate and proj = my_project, phyddle will execute the commands during simulation

Rscript sim_one.R ../workspace/simulate/my_project/sim.0
Rscript sim_one.R ../workspace/simulate/my_project/sim.1
Rscript sim_one.R ../workspace/simulate/my_project/sim.2
...

for every replication index between start_idx and end_idx. In fact, executing Rscript sim_one.R ../workspace/simulate/my_project/sim.0 from terminal is the perfect way to validate that your custom simulator is compatible with the phyddle requirements.

Format

Format converts the simulated data for a project into a tensor format that phyddle uses to train neural networks in the Train step. Format performs two main tasks:

  1. Encode all individual raw datasets in the simulate project directory into individual tensor representations

  2. Combines all the individual tensors into larger, singular tensors to act as the training dataset

For each simulated example, Format encodes the raw data into two input tensors and one output tensor:

  • One input tensor is the phylogenetic-state tensor. Loosely speaking, these tensors contain information about terminal taxa across columns and information about relevant branch lengths and states per taxon across rows. The phylogenetic-state tensors used by phyddle are based on the compact bijective ladderized vector (CBLV) format of Voznica et al. (2022) and the compact diversity-reordered vector (CDV) format of Lambert et al. (2022) that incorporates tip states (CBLV+S and CDV+S) using the technique described in Thompson et al. (2022).

  • The second input is the auxiliary data tensor. This tensor contains summary statistics for the phylogeny and character data matrix and “known” parameters for the data generating process.

  • The output tensor reports labels that are generally unknown data generating parameters to be estimated using the neural network. Depending on the estimation task, all or only some model parameters might be treated as labels for training and estimation.

For most purposes within phyddle, it is safe to think of a tensor as an n-dimensional array, such as a 1-d vector or a 2-d matrix. The tensor encoding ensures training examples share a standard shape (e.g. numbers of rows and columns) that helps the neural network to detect predictable data patterns. Learn more about the formats of phyddle tensors on the Tensor Formats page.

During tensor-encoding, Format processes the tree, data matrix, and model parameters for each replicate. This is done in parallel, when the setting use_parallel is set to True. Simulated data are processed using CBLV+S format if tree_type is set to 'serial'. If tree_type is set to 'extant' then all non-extant taxa are pruned, saved as pruned.tre, then encoded using CDV+S. Each tree is then encoded into a phylogenetic-state tensor with a maximum of tree_width sampled taxa. Trees that contain more taxa are downsampled to tree_width taxa. The number of taxa in each original dataset is recorded in the summary statistics, regardless of its size. The phylogenetc-state tensors and auxiliary data tensors are then created. If save_phyenc_csv is set, then individual csv files are saved for each dataset, which is especially useful for formatting new empirical datasets into an accepted phyddle format. The param_est setting identifies which parameters in the labels tensor you want to treat as downstream estimation targets. The param_data setting identifies which of those parameters you want to treat as “known” auxiliary data. Lastly, Format creates a test dataset containing proportion test_prop of examples, and a second training dataset that contains all remaining examples.

Formatted tensors are then saved to disk either in simple comma-separated value format or in a compressed HDF5 format. For example, suppose we set fmt_dir to 'format', proj to 'example', and tree_encode to 'serial'. If we set tensor_format to 'hdf5' it produces:

workspace/format/example/test.nt200.hdf5
workspace/format/example/train.nt200.hdf5

or if tensor_format == 'csv':

workspace/format/example/test.nt200.aux_data.csv
workspace/format/example/test.nt200.labels.csv
workspace/format/example/test.nt200.phy_data.csv
workspace/format/example/train.nt200.aux_data.csv
workspace/format/example/train.nt200.labels.csv
workspace/format/example/train.nt200.phy_data.csv

These files can then be processed by the Train step.

Train

Train builds a neural network and trains it to make model-based estimates using the training example tensors compiled by the Format step.

The Train step performs six main tasks: 1. Load the input training example tensor. 2. Shuffle the input tensor and split it into training, test, validation, and calibration subsets. 3. Build and configure the neural network 4. Use supervised learning to train neural network to make accurate estimates (predictions) 5. Record network training performance to file 6. Save the trained network to file

When the training dataset is read in, its examples are randomly shuffled by replicate index. It then sets aside some examples for a validation dataset (prop_val) and others for a calibration dataset (prop_cal). Note, some examples were already set aside for the training dataset during the Format step (prop_test). All remaining examples are used for training. A network must be trained against a particular tree_width size (see above).

phyddle uses TensorFlow and Keras to build and train the network. The phylogenetic-state tensor is processed by convolutional and pooling layers, while the auxiliary data is processed by dense layers. All input layers are concatenated then pushed into three branches terminating in output layers to produce point estimates and upper and lower estimation intervals. Here is a simplified schematic of the network architecture:

Simplified network architecture:

                          ,--> Conv1D-normal + Pool --.
    Phylo. Data Tensor --+---> Conv1D-stride + Pool ---\                          ,--> Point estimate
                          `--> Conv1D-dilate + Pool ----+--> Concat + Output(s)--+---> Lower quantile
                                                       /                          `--> Upper quantile
    Aux. Data Tensor   ------> Dense -----------------'

Parameter point estimates use a loss function (e.g. loss set to 'mse'; Tensorflow-supported string or function) while lower/upper quantile estimates use a pinball loss function (hard-coded).

Calibrated prediction intervals (CPIs) are estimated using the conformalized quantile regression technique of Romano et al. (2019). CPIs target a particular estimation interval, e.g. set cpi_coverage to 0.95 so 95% of test estimations are expected contain the true simulating value. More accurate CPIs can be obtained using two-sided conformalized quantile regression by setting cpi_asymmetric to True, though this often requires larger numbers of calibration examples, determined through prop_cal.

The network is trained iteratively for num_epoch training cycles using batch stochastic gradient descent, with batch sizes given by batch_size. Different optimizers can be used to update network weight and bias parameters (e.g. optimizer == 'adam'; Tensorflow-supported string or function). Network performance is also evaluated against validation data set aside with prop_val that are not used for minimizing the loss function.

Training is automatically parallelized using CPUs and GPUs, dependent on how Tensorflow was installed and system hardware. Output files are stored in the directory assigned to trn_dir in the subdirectory proj.

Estimate

Estimate loads the simulated test dataset saved with the format indicated by tensor_format stored in <fmt_dir>/<fmt_proj>. Estimate also loads a new dataset stored in <est_dir>/<est_proj> with filenames <est_prefix.tre> and <est_prefix>.dat.nex, if the new dataset exists.

This step then loads a pretrained network for a given tree_width and uses it to estimate parameter values and calibrated prediction intervals (CPIs) for both the new (empirical) dataset and the test (simulated) dataset. Estimates are then stored as separated datasets into the original <est_dir>/<est_proj> directory.

Plot

Plot collects all results from the Format, Train, and Estimate steps to compile a set of useful figures, listed below. When results from Estimate are available, the step will integrate it into other figures to contextualize where that input dataset and estimateed labels fall with respect to the training dataset.

Plots are stored within <plot_dir> in the <plot_proj> subdirectory. Colors for plot elements can be modified with plot_train_color, plot_label_color, plot_test_color, plot_val_color, plot_aux_color, and plot_est_color using hex codes or common color names supported by Matplotlib.

  • summary.pdf contains all figures in a single plot

  • density_aux_data.pdf - densities of all values in the auxiliary dataset; red line for estimateed dataset

  • density_label.pdf - densities of all values in the auxiliary dataset; red line for estimateed dataset

  • pca_contour_aux_data.pdf - pairwise PCA of all values in the auxiliary dataset; red dot for estimateed dataset

  • pca_contour_label.pdf - pairwise PCA of all values in the auxiliary dataset; red dot for estimateed dataset

  • train_history.pdf - loss performance across epochs for test/validation datasets for entire network

  • train_history_<stat_name>.pdf - loss, accuracy, error performance across epochs for test/validation datasets for particular statistics (point est., lower CPI, upper CPI)

  • estimate_train_<label_name>.pdf - point estimates and calibrated estimation intervals for training dataset

  • estimate_test_<label_name>.pdf - point estimates and calibrated estimation intervals for test dataset

  • estimate_new.pdf - simple plot of point estimates and calibrated estimation intervals for estimation

  • network_architecture.pdf - visualization of Tensorflow architecture