Configuration

Note

This section describes how to configure settings for a phyddle analysis. Visit Pipeline to learn more about how settings determine the behavior of a phyddle analysis. Visit Glossary to learn more about how phyddle defines different terms.

There are two ways to configure the settings of a phyddle analysis: through a config file or the command line. Command line settings outrank config file settings.

By file

The phyddle config file is a Python dictionary of analysis arguments (args) that configure how phyddle pipeline steps behave. Because it’s a Python script, you can write code within the config file to specify your analysis, if you find that helpful. The below example defines settings into different blocks based on which pipeline step first needs a given setting. However, any setting might be used by different pipeline steps, so we concatenate all settings into a single dictionary called args, which is then used by all pipeline steps. Settings configured by file can be adjusted through the command line, if desired.

Note

By default, phyddle assumes you want to use the config file called config.py. Use a different config file by calling, e.g. ./run_pipline --cfg my_other_config.py

#==============================================================================#
# Config:       Default phyddle config file                                    #
# Authors:      Michael Landis and Ammon Thompson                              #
# Date:         230804                                                         #
# Description:  Simple birth-death and equal-rates CTMC model in R using ape   #
#==============================================================================#

args = {
    #-------------------------------#
    # Basic                         #
    #-------------------------------#
    'cfg'                : 'config.py',          # Config file name
    'proj'               : 'my_project',         # Project name(s) for pipeline step(s)
    'step'               : 'SFTEP',              # Pipeline step(s) defined with (S)imulate, (F)ormat,
                                                 #     (T)rain, (E)stimate, (P)lot, or (A)ll
    'verbose'            : 'T',                  # Verbose output to screen?
    'force'              : None,                 # Arguments override config file settings
    'make_cfg'           : None,                 # Write default config file to 'config_default.py'?'

    #-------------------------------#
    # Analysis                      #
    #-------------------------------#
    'use_parallel'       : 'T',                  # Use parallelization? (recommended)
    'num_proc'           : -2,                   # Number of cores for multiprocessing (-N for all but N)

    #-------------------------------#
    # Workspace                     #
    #-------------------------------#
    'sim_dir'            : '../workspace/simulate',  # Directory for raw simulated data
    'fmt_dir'            : '../workspace/format',    # Directory for tensor-formatted simulated data
    'trn_dir'            : '../workspace/train',     # Directory for trained networks and training output
    'est_dir'            : '../workspace/estimate',  # Directory for new datasets and estimates
    'plt_dir'            : '../workspace/plot',      # Directory for plotted results
    'log_dir'            : '../workspace/log',       # Directory for logs of analysis metadata

    #-------------------------------#
    # Simulate                      #
    #-------------------------------#
    'sim_command'        : None,                 # Simulation command to run single job (see documentation)
    'sim_logging'        : 'clean',              # Simulation logging style
    'start_idx'          : 0,                    # Start replicate index for simulated training dataset
    'end_idx'            : 1000,                 # End replicate index for simulated training dataset
    'sim_more'           : 0,                    # Add more simulations with auto-generated indices
    'sim_batch_size'     : 1,                    # Number of replicates per simulation command

    #-------------------------------#
    # Format                        #
    #-------------------------------#
    'encode_all_sim'     : 'T',                  # Encode all simulated replicates into tensor?
    'num_char'           : None,                 # Number of characters
    'num_states'         : None,                 # Number of states per character
    'min_num_taxa'       : 10,                   # Minimum number of taxa allowed when formatting
    'max_num_taxa'       : 1000,                 # Maximum number of taxa allowed when formatting
    'downsample_taxa'    : 'uniform',            # Downsampling strategy taxon count
    'tree_width'         : 500,                  # Width of phylo-state tensor
    'tree_encode'        : 'extant',             # Encoding strategy for tree
    'brlen_encode'       : 'height_brlen',       # Encoding strategy for branch lengths
    'char_encode'        : 'one_hot',            # Encoding strategy for character data
    'param_est'          : None,                 # Model parameters to estimate
    'param_data'         : None,                 # Model parameters treated as data
    'char_format'        : 'nexus',              # File format for character data
    'tensor_format'      : 'hdf5',               # File format for training example tensors
    'save_phyenc_csv'    : 'F',                  # Save encoded phylogenetic tensor encoding to csv?

    #-------------------------------#
    # Train                         #
    #-------------------------------#
    'trn_objective'      : 'param_est',          # Objective of training procedure
    'num_epochs'         : 20,                   # Number of training epochs
    'trn_batch_size'     : 128,                  # Training batch sizes
    'prop_test'          : 0.05,                 # Proportion of data used as test examples (assess network performance)
    'prop_val'           : 0.05,                 # Proportion of data used as validation examples (diagnose overtraining)
    'prop_cal'           : 0.2,                  # Proportion of data used as calibration examples (calibrate CPIs)
    'cpi_coverage'       : 0.95,                 # Expected coverage percent for calibrated prediction intervals (CPIs)
    'cpi_asymmetric'     : 'T',                  # Use asymmetric (True) or symmetric (False) adjustments for CPIs?
    'loss'               : 'mse',                # Loss function for optimization
    'optimizer'          : 'adam',               # Method used for optimizing neural network
    'metrics'            : ['mae', 'acc'],       # Recorded training metrics

    #-------------------------------#
    # Estimate                      #
    #-------------------------------#
    'est_prefix'         : None,                 # Predict results for this dataset

    #-------------------------------#
    # Plot                          #
    #-------------------------------#
    'plot_train_color'   : 'blue',               # Plotting color for training data elements
    'plot_label_color'   : 'purple',             # Plotting color for training label elements
    'plot_test_color'    : 'red',                # Plotting color for test data elements
    'plot_val_color'     : 'green',              # Plotting color for validation data elements
    'plot_aux_color'     : 'orange',             # Plotting color for auxiliary data elements
    'plot_est_color'     : 'black',              # Plotting color for new estimation elements

    }

Via command line

Settings applied through a config file can be overwritten by setting options when running phyddle from the command line. The names of settings are the same for the command line options and in the config file. Using command line options makes it easy to adjust the behavior of pipeline steps without needing to edit the config file. List all settings that can be adjusted with the command line using the --help option:

    $ ./run_phyddle.py --help

usage: run_phyddle.py [-h] [-c] [-p] [-s] [-v] [-f] [--make_cfg] [--use_parallel] [--num_proc] [--sim_dir]
                  [--fmt_dir] [--trn_dir] [--est_dir] [--plt_dir] [--log_dir] [--sim_command] [--sim_logging]
                  [--start_idx] [--end_idx] [--sim_more] [--sim_batch_size] [--encode_all_sim] [--num_char]
                  [--num_states] [--min_num_taxa] [--max_num_taxa] [--downsample_taxa] [--tree_width]
                  [--tree_encode] [--brlen_encode] [--char_encode] [--param_est] [--param_data]
                  [--char_format] [--tensor_format] [--save_phyenc_csv] [--trn_objective] [--num_epochs]
                  [--trn_batch_size] [--prop_test] [--prop_val] [--prop_cal] [--cpi_coverage]
                  [--cpi_asymmetric] [--loss] [--optimizer] [--metrics] [--est_prefix] [--plot_train_color]
                  [--plot_label_color] [--plot_test_color] [--plot_val_color] [--plot_aux_color]
                  [--plot_est_color]

phyddle pipeline config

options:
-h, --help           show this help message and exit
-c , --cfg           Config file name
-p , --proj          Project name(s) for pipeline step(s)
-s , --step          Pipeline step(s) defined with (S)imulate, (F)ormat, (T)rain, (E)stimate, (P)lot, or (A)ll
-v , --verbose       Verbose output to screen?
-f, --force          Arguments override config file settings
--make_cfg           Write default config file to 'config_default.py'?'
--use_parallel       Use parallelization? (recommended)
--num_proc           Number of cores for multiprocessing (-N for all but N)
--sim_dir            Directory for raw simulated data
--fmt_dir            Directory for tensor-formatted simulated data
--trn_dir            Directory for trained networks and training output
--est_dir            Directory for new datasets and estimates
--plt_dir            Directory for plotted results
--log_dir            Directory for logs of analysis metadata
--sim_command        Simulation command to run single job (see documentation)
--sim_logging        Simulation logging style
--start_idx          Start replicate index for simulated training dataset
--end_idx            End replicate index for simulated training dataset
--sim_more           Add more simulations with auto-generated indices
--sim_batch_size     Number of replicates per simulation command
--encode_all_sim     Encode all simulated replicates into tensor?
--num_char           Number of characters
--num_states         Number of states per character
--min_num_taxa       Minimum number of taxa allowed when formatting
--max_num_taxa       Maximum number of taxa allowed when formatting
--downsample_taxa    Downsampling strategy taxon count
--tree_width         Width of phylo-state tensor
--tree_encode        Encoding strategy for tree
--brlen_encode       Encoding strategy for branch lengths
--char_encode        Encoding strategy for character data
--param_est          Model parameters to estimate
--param_data         Model parameters treated as data
--char_format        File format for character data
--tensor_format      File format for training example tensors
--save_phyenc_csv    Save encoded phylogenetic tensor encoding to csv?
--trn_objective      Objective of training procedure
--num_epochs         Number of training epochs
--trn_batch_size     Training batch sizes
--prop_test          Proportion of data used as test examples (assess trained network performance)
--prop_val           Proportion of data used as validation examples (diagnose network overtraining)
--prop_cal           Proportion of data used as calibration examples (calibrate CPIs)
--cpi_coverage       Expected coverage percent for calibrated prediction intervals (CPIs)
--cpi_asymmetric     Use asymmetric (True) or symmetric (False) adjustments for CPIs?
--loss               Loss function for optimization
--optimizer          Method used for optimizing neural network
--metrics            Recorded training metrics
--est_prefix         Predict results for this dataset
--plot_train_color   Plotting color for training data elements
--plot_label_color   Plotting color for training label elements
--plot_test_color    Plotting color for test data elements
--plot_val_color     Plotting color for validation data elements
--plot_aux_color     Plotting color for auxiliary data elements
--plot_est_color     Plotting color for new estimation elements

Table summary

This section summarizes available settings in phyddle. The Setting column is the exact name of the string that appears in the configuration file and command-line argument list. The Step(s) identifies all steps that use the setting: [S]imulate, [F]ormat, [T]rain, [E]stimate, and [P]lot. The Type column is the Python variable type expected for the setting. The Description gives a brief description of what the setting does. Visit Pipeline to learn more about phyddle settings impact different pipeline analysis steps.

phyddle settings

Setting

Step(s)

Type

Description

proj

SFTEP

str

Name of the project directory(s), see detailed description [link]

step

SFTEP

str

Step(s) to run for analysis, see detailed description [link]

verbose

SFTEP

str

Print verbose output? True or False

sim_dir

SF–––

str

Filepath to the Simulate directory

fmt_dir

–FT–P

str

Filepath to the Format directory

trn_dir

––TEP

str

Filepath to the Train directory

est_dir

–––EP

str

Filepath to the Estimate directory

plt_dir

––––P

str

Filepath to the Plot directory

log_dir

––––P

str

Filepath to the Log directory

start_idx

SF–––

int

Start replicate index for simulated training dataset

end_idx

SF–––

int

End replicate index for simulated training dataset

use_parallel

SF–––

str

Use multiprocessing? True or False

num_proc

SF–––

int

Number of cores for multiprocessing (when use_parallel==True)

sim_logging

S––––

str

Method for handling simulation logs (clean: delete, compress: zip, verbose: keep)

num_char

–FTE–

int

Number of characters

num_state

–FTE–

int or int[]

Number of states per character

min_num_taxa

–F–––

int

Minimum number of taxa, smaller datasets discarded

min_num_taxa

–F–––

int

Maximum number of taxa, larger datasets discarded

tree_encode

–FTE–

str

How to encode tree (‘serial’: CBLV, ‘extant’: CDV)

brlen_encode

–FTE–

str

How to encode branch length info (‘height_only’’: minimal, `’height_brlen’: extra info)

char_encode

–FTE–

str

How to encode character data (‘integer’: ordered, ‘one_hot’: categorical)

param_est

–F–E–

str[]

List of parameters in labels to be estimated

param_data

–F–E–

str[]

List of parameters in labels to be treated as auxiliary adata

char_format

–F–E–

str

Character data file format (‘nexus’ or ‘csv’)

tensor_format

FTEP–

str

Training tensor data file format (‘hdf5’ or ‘csv’)

save_phyenc_csv

–F–––

str

Save intermediate phylogenetic tensor encodings to file?

trn_objective

––TEP

str

Training objective for network (‘param_est’)

tree_width

––TEP

int

Number of columns in compact phylogenetic vector + states tensor

num_epochs

––TEP

int

Number of training iterations

batch_size

––TEP

int

Number of training examples per batch with all batches visited during each epoch

prop_test

––T––

float

Proportion of training examples in test dataset

prop_val

––T––

float

Proportion of training examples in validation dataset

prop_cal

––T––

float

Proportion of training examples in calibration dataset

cpi_coverage

––T––

float

Percent of training examples where CPI contains (covers) true parameter value

cpi_asymmetric

––T––

str

Use symmetric one-sided (False) or asymmetric two-sided (True) calibration conformity scores

loss

––T––

str

Loss function used to assess neural network fit

optimizer

––T––

str

Optimizer used to minimize loss score of network during training

metrics

––T––

str[]

Metrics to log during training history

est_prefix

–––EP

str

Name of prefix for new (biological) dataset to estimate

plot_train_color

––––P

str

Plot color for training data elements

plot_test_color

––––P

str

Plot color for test data elements

plot_val_color

––––P

str

Plot color for validation data elements

plot_aux_color

––––P

str

Plot color for auxiliary data elements

plot_label_color

––––P

str

Plot color for label (parameter) elements

plot_est_color

––––P

str

Plot color for estimates (for new dataset) elements

Special settings

This section provides detailed descriptions for several settings that are not intuitive to specify, but very powerful when used correctly.

step

The step setting controls which steps should be applied. Each pipeline step is represented by a capital letter: S for Simulate, F for Format, T for Train, E for Estimate, P for Plot, and A for all steps.

For example, the following two commands are equivalent

./run_phyddle.py --step A
./run_phyddle.py --step SFTEP

whereas calling

./run_phyddle.py --step SF

commands phyddle to perform the Simulate and Format steps, but not the Train, Estimate, or Plot steps.

proj

The proj setting controls how project names are assigned to different pipeline steps. Typically, proj is provided a single project name that is shared across all pipeline steps. For example, calling

./run_phyddle.py --proj my_project

causes all results from this phyddle analysis to be stored in a subdirectory called my_project. The proj setting can also be used to specify different project names for individual pipeline steps. For example, calling

./run_phyddle.py --proj my_project,E:new_estimate,P:new_plot

would use new_estimate as the project name for the E step (Estimate), new_plot for the P step (Plot), and my_project for all other steps.