Workspace
Note
(incomplete) Important: This section assume the project name is ‘example’ while actual projects will likely use different names.
This section describes how phyddle organizes files and directories in its workspace. Visit Formats to learn more about file formats. Visit Configuration to learn more about managing directories and projects within a workspace.
By default, phyddle saves work from its pipeline steps to the location
workspace
directory. Briefly, the workspace
directory itself contains
six subdirectories corresponding to the five pipeline steps, plus one directory
for logs:
simulate
contains raw data generated by simulationformat
contains data formatted into tensors for training networkstrain
contains trained networks and diagnosticsestimate
contains new test datasets their estimateionsplot
contains figures of training and validation procedureslog
contains runtime logs for a phyddle project
This section will assume all steps are using the example
project
bundled with phyddle was generated using the command
./scripts/run_phyddle.sh --cfg config --proj example --end_idx 25000
This corresponds to a 3-region equal-rates GeoSSE model. All directories have
the complete file set, except ./workspace/simulate/example
contains only
20 original examples.
A standard configuration for a project named example
would store pipeline
work into these directories:
workspace/simulate/example # output of Simulate step
workspace/format/example # output of Format step
workspace/train/example # output of Train step
workspace/estimate/example # output of Estimate step
workspace/plot/example # output of Plot step
workspace/log/example # analysis logs
Next, we give an overview of the standard files and formats corresponding to each pipeline directory.
simulate
The Simulate step generates raw data from a simulating model that cannot yet be fed to the neural network for training. A typical simulation will produce the following files
workspace/simulate/example/sim.0.tre # Newick string
workspace/simulate/example/sim.0.dat.nex # Nexus file
workspace/simulate/example/sim.0.param_row.csv # data-generating params
format
Applying Format to a directory of simulated datasets will output
tensors containing the entire set of training examples, stored to, e.g.
workspace/format/example
. What formatted files are created depends on
the value of tensor_format
and tree_width
.
When tree_width
is set to 200
, Format will yield two simulated
dataset tensors: one for the training examples and another for the test
examples.
If the tensor_format
setting is 'csv'
(Comma-Separated Value, or CSV
format), the formatted files are:
test.nt200.phy_data.csv
test.nt200.aux_data.csv
test.nt200.labels.csv
train.nt200.phy_data.csv
train.nt200.aux_data.csv
train.nt200.labels.csv
where the phy_data.csv files contain one flattened Compact Phylogenetic Vector + States (CPV+S) entry per row, the aux_data.csv files contain one vector of auxiliary data (summary statistics and known parameters) values per row, and labels.csv contains one vector of label (estimated parameters) per row. Each row for each of the CSV files will correspond to a single, matched simulated training example. All files are stored in standard comma-separated value format, making them easily read by standard CSV-reading functions.
If the tensor_format
setting is 'hdf5'
, the resulting files are:
test.nt200.hdf5
train.nt200.hdf5
where each HDF5 file contains all phylogenetic-state (CPV+S) data, auxiliary data, and label data. Individual simulated training examples share the same set of ordered examples across three iternal datasets stored in the file. HDF5 format is not as easily readable as CSV format. However, phyddle uses gzip to automatically (de)compress records, which often leads to files that are over twenty times smaller than equivalent uncompressed CSV formatted tensors.
train
Training a network creates the following files in the workspace/train/my_project
directory:
network_nt200.cpi_adjustments.csv
network_nt200.hdf5
network_nt200.train_aux_data_norm.csv
network_nt200.train_est.csv
network_nt200.train_est.labels.csv
network_nt200.train_history.json
network_nt200.train_label_est_nocalib.csv
network_nt200.train_label_norm.csv
network_nt200.train_true.labels.csv
For example, the network prefix sim_batchsize128_numepoch20_nt500
indicated
a network trained with a batch size of 128 samples for 20 epochs on the tree
width size-category of max. 500 taxa.
Descriptions of the files are as follows, with train_prefix
omitted for brevity:
* network.hdf5
: a saved copy of the trained neural network that can be loaded by Tensorflow
* train_label_norm.csv
and train_aux_data_norm.csv
: the location-scale values from the training dataset to (de)normalize the labels and auxiliary data from any dataset
* train_true.labels.csv
: the true values of labels for the training and test datasets, where columns correspond to estimated labels (e.g. model parameters)
* train_est.labels.csv
: the trained network estimates of labels for the training and test datasets, with calibrated prediction intervals, where columns correspond to point estimates and estimates for lower CPI and upper CPI bounds for each named label (e.g. model parameter)
* train_label_est_nocalib.csv
: the trained network estimates of labels for the training and test datasets, with uncalibrated prediction intervals
* train_history.json
: the metrics across training epochs monitored during network training
* cpi_adjustments.csv
: calibrated prediction interval adjustments, where columns correspond to parameters, the first row contains lower bound adjustments, and the second row contains upper bound adjustments
estimate
The Estimate step will both read new (biological) datasets from the
project directory, and save new intermediate files, and store outputted
estimates in the same directory, located at e.g.
workspace/estimate/example
:
new.1.tre # input: initial tree
new.1.dat.nex # input: character data
new.1.known_params.csv # input: params for aux. data (optional)
new.1.extant.tre # intermediate: pruned tree
new.1.phy_data.csv # intermediate: CPV+S tensor data
new.1.aux_data.csv # intermediate: aux. data tensor data
new.1.info.csv # intermediate: formatting info
new.1.network_nt200.est_labels.csv # output: estimates
All files have previously been explained in the simulate
, format
,
or train
workspace sections, except for two.
The known_params.csv
file is optional, and is used to provide “known”
data-generating parameter values to the network for training, as part of the
auxiliary dataset. If provided, it contains a row of names for known parameters
followed by a row of respective values.
The est_labels.csv
file reports the point estimates and lower and upper
CPI estimates for all targetted parameters. Estimates for parameters appear
across columns, where columns are grouped first by label (e.g. parameter) and
then statistic (e.g. value, lower-bound, upper-bound). For example:
$ cat new.1.sim_batchsize128_numepoch20_nt500.pred_labels.csv
w_0_value,w_0_lower,w_0_upper,e_0_value,e_0_lower,e_0_upper,d_0_1_value,d_0_1_lower,d_0_1_upper,b_0_1_value,b_0_1_lower,b_0_1_upper
0.2867125345651129,0.1937433853918723,0.45733220552078013,0.02445545359384659,0.002880695707341881,0.10404499205878459,0.4502031713887769,0.1966340488593367,0.5147956690178682,0.06199703190510973,0.0015074254823161301,0.27544015163806645
plot
The Plot step generates visualizations for results previously generated by Format, Train, and (when available) Estimate.
est_CPI.pdf # results from Estimate step
density_{label,aux_data}.pdf # densities from Simulate/Format steps
pca_contour_{label,aux_data}.pdf # PCA of Simulate/Format steps
estimate_{test,train}_{param}.pdf # estimation accuracy from Train steps
history.pdf # training history for entire network
history_param_{statistic}.pdf # training history for each estimation target
network_architecture.pdf # neural network architecture
summary.pdf # compiled report of all figures
Visit Pipeline to learn more about the files.