Benchmark Datasets#
The benchmarking system includes tools for generating standardized datasets with known ground truth. These datasets enable fair comparison of inference methods and measurement of accuracy.
Generating Datasets#
Use the generate-datasets subcommand to create benchmark datasets:
argscape_benchmark generate-datasets
This creates simulated tree sequences with spatial locations assigned to samples.
Command Options#
argscape_benchmark generate-datasets [OPTIONS]
Option |
Short |
Description |
Default |
|---|---|---|---|
|
|
Output directory |
|
|
Comma-separated sample counts |
|
|
|
Comma-separated sequence lengths |
|
|
|
Random seed for reproducibility |
|
|
|
|
Suppress progress output |
False |
Standard Dataset Configuration#
By default, the system generates two series of datasets:
Sample Scaling Series#
Datasets with varying numbers of samples but fixed sequence length (1 Mb):
Dataset Name |
Samples |
Sequence Length |
|---|---|---|
|
50 |
1,000,000 bp |
|
100 |
1,000,000 bp |
|
250 |
1,000,000 bp |
|
500 |
1,000,000 bp |
This series tests how methods scale with the number of individuals.
Sequence Length Scaling Series#
Datasets with fixed sample count (100) but varying sequence lengths:
Dataset Name |
Samples |
Sequence Length |
|---|---|---|
|
100 |
100,000 bp |
|
100 |
1,000,000 bp |
|
100 |
10,000,000 bp |
This series tests how methods scale with genome size (and consequently, the number of local trees).
Custom Dataset Generation#
Different Sample Counts#
argscape_benchmark generate-datasets --samples 25,50,75,100,150,200
Different Sequence Lengths#
argscape_benchmark generate-datasets --sequence-lengths 5e4,1e5,5e5,1e6
Combined Customization#
argscape_benchmark generate-datasets \
--output my_datasets \
--samples 50,100,200 \
--sequence-lengths 1e5,1e6 \
--seed 12345
Simulation Parameters#
Datasets are generated using msprime with the following parameters:
Parameter |
Value |
Description |
|---|---|---|
Recombination rate |
1e-8 |
Per-base per-generation |
Mutation rate |
1e-8 |
Per-base per-generation |
Population size |
10,000 |
Effective population size |
Random seed |
42 (default) |
For reproducibility |
These parameters produce realistic tree sequences representative of human genetic data.
Spatial Locations#
Sample individuals are assigned spatial locations using a uniform distribution within a square area (100 x 100 coordinate units by default).
The spatial distribution options include:
uniform_square: Uniform random locations in a square area
uniform_globe: Random latitude/longitude coordinates (avoiding extreme latitudes)
Ground Truth#
Each generated dataset includes known ground truth for accuracy testing:
What Constitutes Ground Truth#
For simulated datasets:
Sample locations: The assigned spatial coordinates
Ancestor times: The true coalescence times from the simulation
Tree topology: The true genealogical relationships
How Accuracy is Measured#
During benchmarking, inferred values are compared to ground truth:
Spatial inference: Inferred ancestor locations are compared to the simulation’s internal locations (if available) or the expected locations based on the dispersal model
Temporal inference: Inferred node times are compared to true coalescence times
Dataset Metadata#
Each .trees file has an accompanying .json metadata file:
{
"name": "samples_100",
"num_samples": 100,
"sequence_length": 1000000.0,
"num_trees": 1523,
"num_nodes": 2534,
"recombination_rate": 1e-08,
"population_size": 10000,
"random_seed": 42,
"spatial_distribution": "uniform_square",
"has_ground_truth": true
}
This metadata is used during benchmarking to:
Report dataset characteristics in results
Determine if accuracy metrics can be computed
Ensure reproducibility
Using Your Own Datasets#
You can benchmark against your own datasets by specifying a directory:
argscape_benchmark run --datasets /path/to/your/datasets
Requirements for custom datasets:
Files must have
.treesextensionFiles must be valid tskit tree sequences
For accuracy testing, samples should have locations in individual metadata
Adding Metadata for Custom Datasets#
Create a .json file with the same base name as your .trees file:
{
"name": "my_dataset",
"num_samples": 150,
"sequence_length": 5000000.0,
"num_trees": 3200,
"num_nodes": 4500,
"has_ground_truth": false
}
Set has_ground_truth to true only if your dataset includes known true locations/times for accuracy comparison.
Dataset Organization#
The default directory structure:
datasets/
simulated/
samples_50.trees
samples_50.json
samples_100.trees
samples_100.json
...
real/
(for real-world datasets)
The benchmark runner searches both simulated/ and real/ subdirectories by default.
Reproducibility#
Ensuring Identical Datasets#
Use the same random seed to generate identical datasets:
argscape_benchmark generate-datasets --seed 42
Different seeds produce different random topologies and locations.
Version Considerations#
Dataset characteristics may vary slightly between msprime versions. Record the msprime version (stored in summary.json) for reproducibility.
Memory Considerations#
Large datasets require more memory for both generation and benchmarking:
Samples |
Sequence Length |
Approx. File Size |
Memory for Inference |
|---|---|---|---|
50 |
1 Mb |
~1 MB |
~100-200 MB |
100 |
1 Mb |
~3 MB |
~200-400 MB |
250 |
1 Mb |
~10 MB |
~500-800 MB |
500 |
1 Mb |
~30 MB |
~1-2 GB |
100 |
10 Mb |
~25 MB |
~500 MB - 1 GB |
Plan your benchmark suite according to available system resources.
Tips for Dataset Selection#
For Method Comparison#
Use multiple sample counts to understand scaling behavior
Include both small datasets (fast iteration) and large datasets (stress testing)
Use consistent random seeds across benchmark runs
For Accuracy Analysis#
Simulated datasets provide ground truth for quantitative accuracy metrics
Real datasets may be used for qualitative evaluation but lack ground truth
Consider the spatial extent and distribution when interpreting spatial errors
For Publication#
Document all simulation parameters
Use consistent seeds for reproducibility
Generate datasets covering the range of interest for your research question