Interpreting Benchmark Results#

This guide explains the output files generated by the benchmarking system and how to interpret the metrics.

Output Files Overview#

After running benchmarks, results are stored in several files:

File

Contents

inference/raw_metrics.csv

Detailed inference performance metrics

visualization/raw_metrics.csv

Visualization performance metrics

summary.json

Complete results with system information

tables/inference_comparison.tex

LaTeX table for publications

figures/scaling_plots.png

Performance scaling visualizations

Inference Metrics (inference/raw_metrics.csv)#

This CSV file contains one row per method-dataset combination with the following columns:

Identification Columns#

Column

Description

method

Inference method name (e.g., fastgaia, midpoint, tsdate)

dataset

Dataset name (e.g., samples_100, seqlen_1000000)

Performance Metrics#

Column

Unit

Description

wall_time_s

seconds

Total elapsed time from start to finish

cpu_time_s

seconds

CPU time consumed (may differ from wall time due to I/O waits)

peak_memory_mb

megabytes

Maximum memory usage during inference

Dataset Characteristics#

Column

Description

num_nodes

Total number of nodes in the tree sequence

num_samples

Number of sample nodes

num_trees

Number of local trees

Accuracy Metrics (Spatial)#

For spatial inference methods (fastgaia, gaia-quadratic, gaia-linear, midpoint, sparg, spacetrees):

Column

Unit

Description

spatial_mean_error

km (or coord units)

Mean distance between inferred and true locations

spatial_median_error

km (or coord units)

Median distance error

spatial_rmse

km (or coord units)

Root mean squared error

Accuracy Metrics (Temporal)#

For temporal inference (tsdate):

Column

Description

temporal_mean_error

Mean absolute difference between inferred and true node times

temporal_correlation

Pearson correlation between inferred and true times

Error Column#

Column

Description

error

Error message if inference failed, empty otherwise

Example Output#

method,dataset,wall_time_s,cpu_time_s,peak_memory_mb,num_nodes,num_samples,num_trees,spatial_mean_error,spatial_median_error,spatial_rmse,error
fastgaia,samples_50,1.23,1.15,156.2,1247,50,892,12.5,10.2,15.8,
fastgaia,samples_100,3.45,3.21,245.8,2534,100,1523,11.8,9.5,14.2,
midpoint,samples_50,0.45,0.42,89.3,1247,50,892,18.7,16.4,22.1,
tsdate,samples_50,2.87,2.65,198.4,1247,50,892,,,,,

Visualization Metrics (visualization/raw_metrics.csv)#

This CSV file contains visualization performance measurements:

Columns#

Column

Unit

Description

tool

Visualization tool name (argscape-2d, argscape-3d)

dataset

Dataset name

render_time_ms

milliseconds

Time to initial render

time_to_interactive_ms

milliseconds

Time until visualization responds to input

fps_pan

frames/second

Frame rate during pan operations

fps_zoom

frames/second

Frame rate during zoom operations

heap_size_mb

megabytes

JavaScript heap memory usage

num_nodes

Number of nodes in the tree sequence

num_samples

Number of samples

error

Error message if benchmark failed

Interpreting Visualization Metrics#

Render time and time to interactive:

  • Lower values indicate faster loading

  • Large datasets may have significantly longer load times

  • Time to interactive should be close to render time for responsive UIs

Frame rates (FPS):

  • 60 FPS is ideal for smooth interaction

  • 30+ FPS is acceptable

  • Below 15 FPS indicates performance issues

  • FPS typically decreases with larger datasets

Heap size:

  • Indicates browser memory consumption

  • Higher values may cause browser performance issues

  • Track this metric to identify memory leaks

Summary JSON (summary.json)#

The summary file contains:

System Information#

{
  "system_info": {
    "timestamp": "2024-01-15T10:30:00.000000",
    "argscape_version": "1.0.0",
    "python_version": "3.11.0",
    "platform": "macOS-14.0-arm64",
    "processor": "arm",
    "machine": "arm64",
    "dependencies": {
      "tskit": "0.5.6",
      "msprime": "1.3.0",
      "fastgaia": "0.2.0",
      "gaiapy": "0.1.0",
      "playwright": "1.40.0"
    }
  }
}

Complete Results#

The JSON also contains the complete inference_results and visualization_results arrays, matching the CSV contents but in JSON format for programmatic access.

Accuracy Interpretation#

Spatial Accuracy#

Spatial accuracy metrics compare inferred ancestor locations to ground truth:

  • Mean error: Average distance error across all compared nodes. Lower is better.

  • Median error: Middle value of errors. Less sensitive to outliers than mean.

  • RMSE: Penalizes large errors more heavily. Useful for identifying methods with occasional large mistakes.

Spatial accuracy is only computed for non-sample nodes (samples have known locations used as input).

When interpreting spatial errors:

  • Units depend on the coordinate system (km for geographic, coordinate units otherwise)

  • Compare methods on the same dataset for fair comparison

  • Consider the spatial extent of your data when evaluating error magnitudes

Temporal Accuracy#

Temporal accuracy metrics compare inferred node times to ground truth:

  • Mean/Median error: Average time difference. Units match the tree sequence time units.

  • Correlation: How well inferred times track true times. Values close to 1.0 indicate good relative ordering even if absolute times differ.

A high correlation with moderate mean error suggests the method captures the relative timing well but may have systematic bias.

Scaling Analysis#

The benchmarking system generates scaling plots showing how performance changes with dataset size:

Time Scaling#

The time scaling plot shows wall time vs. number of samples on log-log axes. Look for:

  • Linear scaling (slope ~1): Time grows proportionally with input size

  • Quadratic scaling (slope ~2): Time grows with the square of input size

  • Methods with better scaling: Lines with lower slopes are more efficient for large datasets

Memory Scaling#

The memory scaling plot shows peak memory vs. number of samples:

  • Memory usage often correlates with the number of nodes

  • Some methods have higher base memory requirements

  • Look for methods that remain practical for your target dataset sizes

LaTeX Tables#

The generated LaTeX table (tables/inference_comparison.tex) summarizes method performance:

\begin{table}[htbp]
\centering
\caption{Inference Method Performance Comparison}
\begin{tabular}{lrrrrr}
\toprule
Method & Time (s) & Memory (MB) & Mean Error & Median Error \\
\midrule
fastgaia & 2.34 & 201.5 & 12.15 & 9.85 \\
midpoint & 0.89 & 112.3 & 17.42 & 15.21 \\
...
\bottomrule
\end{tabular}
\label{tab:inference-comparison}
\end{table}

Values are averaged across all datasets for each method.

Common Analysis Tasks#

Finding the Fastest Method#

Sort by wall_time_s for a given dataset:

import pandas as pd

df = pd.read_csv("results/inference/raw_metrics.csv")
dataset_results = df[df["dataset"] == "samples_100"]
fastest = dataset_results.sort_values("wall_time_s").iloc[0]
print(f"Fastest: {fastest['method']} ({fastest['wall_time_s']:.2f}s)")

Finding the Most Accurate Method#

Sort by accuracy metrics:

# For spatial methods
spatial = df[df["spatial_mean_error"].notna()]
most_accurate = spatial.sort_values("spatial_mean_error").iloc[0]
print(f"Most accurate: {most_accurate['method']} ({most_accurate['spatial_mean_error']:.2f} units)")

Identifying Failed Benchmarks#

Filter for errors:

failed = df[df["error"].notna()]
print(f"Failed benchmarks: {len(failed)}")
print(failed[["method", "dataset", "error"]])

Comparing Methods Across Dataset Sizes#

Pivot the data to compare scaling:

pivot = df.pivot_table(
    values="wall_time_s",
    index="dataset",
    columns="method",
    aggfunc="mean"
)
print(pivot)