Interpreting Benchmark Results#
This guide explains the output files generated by the benchmarking system and how to interpret the metrics.
Output Files Overview#
After running benchmarks, results are stored in several files:
File |
Contents |
|---|---|
|
Detailed inference performance metrics |
|
Visualization performance metrics |
|
Complete results with system information |
|
LaTeX table for publications |
|
Performance scaling visualizations |
Inference Metrics (inference/raw_metrics.csv)#
This CSV file contains one row per method-dataset combination with the following columns:
Identification Columns#
Column |
Description |
|---|---|
|
Inference method name (e.g., |
|
Dataset name (e.g., |
Performance Metrics#
Column |
Unit |
Description |
|---|---|---|
|
seconds |
Total elapsed time from start to finish |
|
seconds |
CPU time consumed (may differ from wall time due to I/O waits) |
|
megabytes |
Maximum memory usage during inference |
Dataset Characteristics#
Column |
Description |
|---|---|
|
Total number of nodes in the tree sequence |
|
Number of sample nodes |
|
Number of local trees |
Accuracy Metrics (Spatial)#
For spatial inference methods (fastgaia, gaia-quadratic, gaia-linear, midpoint, sparg, spacetrees):
Column |
Unit |
Description |
|---|---|---|
|
km (or coord units) |
Mean distance between inferred and true locations |
|
km (or coord units) |
Median distance error |
|
km (or coord units) |
Root mean squared error |
Accuracy Metrics (Temporal)#
For temporal inference (tsdate):
Column |
Description |
|---|---|
|
Mean absolute difference between inferred and true node times |
|
Pearson correlation between inferred and true times |
Error Column#
Column |
Description |
|---|---|
|
Error message if inference failed, empty otherwise |
Example Output#
method,dataset,wall_time_s,cpu_time_s,peak_memory_mb,num_nodes,num_samples,num_trees,spatial_mean_error,spatial_median_error,spatial_rmse,error
fastgaia,samples_50,1.23,1.15,156.2,1247,50,892,12.5,10.2,15.8,
fastgaia,samples_100,3.45,3.21,245.8,2534,100,1523,11.8,9.5,14.2,
midpoint,samples_50,0.45,0.42,89.3,1247,50,892,18.7,16.4,22.1,
tsdate,samples_50,2.87,2.65,198.4,1247,50,892,,,,,
Visualization Metrics (visualization/raw_metrics.csv)#
This CSV file contains visualization performance measurements:
Columns#
Column |
Unit |
Description |
|---|---|---|
|
Visualization tool name ( |
|
|
Dataset name |
|
|
milliseconds |
Time to initial render |
|
milliseconds |
Time until visualization responds to input |
|
frames/second |
Frame rate during pan operations |
|
frames/second |
Frame rate during zoom operations |
|
megabytes |
JavaScript heap memory usage |
|
Number of nodes in the tree sequence |
|
|
Number of samples |
|
|
Error message if benchmark failed |
Interpreting Visualization Metrics#
Render time and time to interactive:
Lower values indicate faster loading
Large datasets may have significantly longer load times
Time to interactive should be close to render time for responsive UIs
Frame rates (FPS):
60 FPS is ideal for smooth interaction
30+ FPS is acceptable
Below 15 FPS indicates performance issues
FPS typically decreases with larger datasets
Heap size:
Indicates browser memory consumption
Higher values may cause browser performance issues
Track this metric to identify memory leaks
Summary JSON (summary.json)#
The summary file contains:
System Information#
{
"system_info": {
"timestamp": "2024-01-15T10:30:00.000000",
"argscape_version": "1.0.0",
"python_version": "3.11.0",
"platform": "macOS-14.0-arm64",
"processor": "arm",
"machine": "arm64",
"dependencies": {
"tskit": "0.5.6",
"msprime": "1.3.0",
"fastgaia": "0.2.0",
"gaiapy": "0.1.0",
"playwright": "1.40.0"
}
}
}
Complete Results#
The JSON also contains the complete inference_results and visualization_results arrays, matching the CSV contents but in JSON format for programmatic access.
Accuracy Interpretation#
Spatial Accuracy#
Spatial accuracy metrics compare inferred ancestor locations to ground truth:
Mean error: Average distance error across all compared nodes. Lower is better.
Median error: Middle value of errors. Less sensitive to outliers than mean.
RMSE: Penalizes large errors more heavily. Useful for identifying methods with occasional large mistakes.
Spatial accuracy is only computed for non-sample nodes (samples have known locations used as input).
When interpreting spatial errors:
Units depend on the coordinate system (km for geographic, coordinate units otherwise)
Compare methods on the same dataset for fair comparison
Consider the spatial extent of your data when evaluating error magnitudes
Temporal Accuracy#
Temporal accuracy metrics compare inferred node times to ground truth:
Mean/Median error: Average time difference. Units match the tree sequence time units.
Correlation: How well inferred times track true times. Values close to 1.0 indicate good relative ordering even if absolute times differ.
A high correlation with moderate mean error suggests the method captures the relative timing well but may have systematic bias.
Scaling Analysis#
The benchmarking system generates scaling plots showing how performance changes with dataset size:
Time Scaling#
The time scaling plot shows wall time vs. number of samples on log-log axes. Look for:
Linear scaling (slope ~1): Time grows proportionally with input size
Quadratic scaling (slope ~2): Time grows with the square of input size
Methods with better scaling: Lines with lower slopes are more efficient for large datasets
Memory Scaling#
The memory scaling plot shows peak memory vs. number of samples:
Memory usage often correlates with the number of nodes
Some methods have higher base memory requirements
Look for methods that remain practical for your target dataset sizes
LaTeX Tables#
The generated LaTeX table (tables/inference_comparison.tex) summarizes method performance:
\begin{table}[htbp]
\centering
\caption{Inference Method Performance Comparison}
\begin{tabular}{lrrrrr}
\toprule
Method & Time (s) & Memory (MB) & Mean Error & Median Error \\
\midrule
fastgaia & 2.34 & 201.5 & 12.15 & 9.85 \\
midpoint & 0.89 & 112.3 & 17.42 & 15.21 \\
...
\bottomrule
\end{tabular}
\label{tab:inference-comparison}
\end{table}
Values are averaged across all datasets for each method.
Common Analysis Tasks#
Finding the Fastest Method#
Sort by wall_time_s for a given dataset:
import pandas as pd
df = pd.read_csv("results/inference/raw_metrics.csv")
dataset_results = df[df["dataset"] == "samples_100"]
fastest = dataset_results.sort_values("wall_time_s").iloc[0]
print(f"Fastest: {fastest['method']} ({fastest['wall_time_s']:.2f}s)")
Finding the Most Accurate Method#
Sort by accuracy metrics:
# For spatial methods
spatial = df[df["spatial_mean_error"].notna()]
most_accurate = spatial.sort_values("spatial_mean_error").iloc[0]
print(f"Most accurate: {most_accurate['method']} ({most_accurate['spatial_mean_error']:.2f} units)")
Identifying Failed Benchmarks#
Filter for errors:
failed = df[df["error"].notna()]
print(f"Failed benchmarks: {len(failed)}")
print(failed[["method", "dataset", "error"]])
Comparing Methods Across Dataset Sizes#
Pivot the data to compare scaling:
pivot = df.pivot_table(
values="wall_time_s",
index="dataset",
columns="method",
aggfunc="mean"
)
print(pivot)