1. Methodology: A Rigorous Framework for Assessing Computational Reproducibility

Input file names (“-thN-” indicates the numbers of threads used):

  • `run 1:PPMI-101018-20210412-1496225-th48-mmwide_sr_run1.csv

  • `run 2:PPMI-101018-20210412-1496225-th48-mmwide_sr_run1.csv

  • see here for source and results data.

The core purpose of this document is to provide a comprehensive and quantitative assessment of the computational reproducibility of the docs/mm_csv_localint.py (docs/mm_csv_localint_sr.py) script for an ANTsPyMM run on PPMI multiple modality MRI (M3RI). The analysis is based on a direct comparison of the tabular outputs generated from two independent executions of this script on the same computer and the same M3RI collection.

To ensure a robust and meaningful comparison, both runs were conducted within a standardized and controlled computational environment, defined by a single, version-pinned Dockerfile. This approach is critical as it minimizes variability stemming from the underlying operating system, software libraries, and their respective versions.

1.1 The Controlled Environment: Docker

The Dockerfile, detailed below, defines the precise environment for this experiment. Its key features contributing to reproducibility are:

  • Base Image: The environment is built FROM tensorflow/tensorflow:2.17.0. This pins the base operating system, Python version, core TensorFlow libraries, and underlying CUDA/cuDNN versions, providing a stable foundation.
  • Pinned Dependencies: All core antspy* libraries are pinned to specific versions (e.g., antspyx==0.6.1), preventing unexpected changes from library updates. While fundamental packages like numpy and scipy are not explicitly version-pinned in the pip install command, the use of the same built Docker image for both runs ensures that identical versions were used for this specific comparison.
  • Localized Data: All external resources, such as model weights from antsxnet, are downloaded and “baked” into the environment at build time using git clone and get_antsxnet_data.py. This prevents variability that could arise from downloading data at runtime.
FROM tensorflow/tensorflow:2.17.0

ENV HOME=/workspace
WORKDIR $HOME

# Set environment variables for optimal threading
ENV TF_NUM_INTEROP_THREADS=8 \
    TF_NUM_INTRAOP_THREADS=8 \
    ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=8 \
    OPENBLAS_NUM_THREADS=8 \
    MKL_NUM_THREADS=8

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    git \
    curl \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Upgrade pip and install Python libraries
RUN pip install --upgrade pip \
    && pip install \
    psrecord \
    numpy \
    pandas \
    scipy \
    matplotlib \
    scikit-learn \
    ipython \
    jupyterlab \
    antspyx==0.6.1 \
    antspynet==0.3.1 \
    antspyt1w==1.1.3 \
    antspymm==1.6.4 \
    siq==0.4.1

# for downloading example data from open neuro
RUN pip3 --no-cache-dir install --upgrade awscli
###########
#
RUN git clone https://github.com/stnava/ANTPD_antspymm.git ${HOME}/ANTPD_antspymm
RUN python ${HOME}/ANTPD_antspymm/src/get_antsxnet_data.py ${HOME}/.keras

1.2 Controlled Parallelism and the Goal of This Analysis

A critical feature of the Dockerfile is the explicit setting of threading variables (e.g., ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=8). This setup deliberately prioritizes performance by enabling multi-threading for computationally intensive libraries like ITK (used by ANTsR/ANTsPy) and TensorFlow.

However, enabling parallelism has a direct and understood consequence for reproducibility: the order of floating-point operations in parallelized calculations is not guaranteed to be identical across runs. This can introduce minute, non-deterministic numerical variations. As such, the run for the experiments reported below are done with methodological and parameter choices that seek to maximize reproducibility while maintaining the performance benefits of parallel processing.

The primary objective of this report is not to achieve bit-wise identical outputs, but to verify that the results are statistically stable and reproducible within a predefined numerical tolerance. However, we will see that the results are nearly identical for the vast majority of variables, with only a few exceptions that are expected and understood.

1.3 The Comparison Framework

Our analysis will proceed as follows:

  1. Variable Classification: Each variable (column) in the output is systematically classified into a MeasurementType_Atlas group using a custom, rule-based engine. This allows us to assess whether specific parts of the processing pipeline are more or less stable.
  2. Difference Calculation: For each variable, we calculate the Symmetric Percentage Difference (SPD): 2 * |RunA - RunB| / (|RunA| + |RunB|). This metric is robust for comparing values across different scales and serves as our primary measure of numerical discrepancy.
  3. Categorization of Discrepancies: Each comparison is categorized as:
    • Identical: Bit-wise identical values.
    • Numerically Identical: SPD is less than our defined tolerance (1e-8).
    • Minor to Significant Numerical Difference: SPD exceeds the tolerance, categorized by magnitude.
    • Structural Mismatch: A critical failure where a variable’s data type changes, or a value becomes NA/Inf in one run but not the other.

This multi-faceted approach allows us to confidently assess the stability of the pipeline, pinpointing both critical errors and areas of minor numerical variability for targeted review.

Glossary of Variable Classifications

The variable classification names used throughout this report (e.g., dti.fa.cortex, t1.thk.mtl) are a composite of a Measurement Type and an Anatomical/Methodological Context. This glossary defines the abbreviations and terms used to construct these classes, providing a clear reference for interpreting the results.

Term/Abbreviation Description Appears In
vol Volume: A measure of the size of a structure, typically in mm³. Measurement Type
thk Thickness: A measure of geometric thickness, typically in mm. Measurement Type
area Area: A measure of surface area, typically in mm². Measurement Type
dti Diffusion Tensor Imaging: A general prefix for metrics derived from DTI data. Measurement Type
fa Fractional Anisotropy: A primary DTI measure of white matter integrity, reflecting the directionality of water diffusion. Measurement Type
md Mean Diffusivity: A primary DTI measure reflecting the average magnitude of water diffusion. Measurement Type
t1 T1w Hierarchical: Indicates a value derived from the main antspyt1w hierarchical segmentation and labeling process. Measurement Type
t1.vth Direct Cortical Thickness: A specific T1-based cortical thickness measurement (t1vth). Measurement Type
melanin Neuromelanin: Indicates a measure derived from neuromelanin-sensitive imaging pipelines. Measurement Type
rsf Resting-State fMRI: A general prefix for metrics derived from resting-state functional MRI data. Measurement Type
falff Fractional Amplitude of Low-Frequency Fluctuations: An rs-fMRI measure of the relative aplitude of brain activity. Measurement Type
peraf Percent Absolute Fluctuation: An rs-fMRI measure derived from the original ALFF, defined as a percentage. Measurement Type
p[1,2,3] Parameter Set [1,2,3]: Refers to one of three different rs-fMRI processing parameter sets used. Measurement Type
dfn.corr Default Mode Network Correlation: A correlation value specifically related to the Default Mode Network. Measurement Type
oth.corr Other Network Correlation: A correlation value related to functional networks other than the DMN. Measurement Type
cortex Cortex: Indicates that the measurement pertains to regions within the cerebral cortex. Anatomical/Method
cerebell Cerebellum: Indicates that the measurement pertains to regions within the cerebellum. Anatomical/Method
wm White Matter: Indicates measurements within white matter tracts. Anatomical/Method
bst Brain Stem: Indicates measurements within the brain stem. Anatomical/Method
midbrain Midbrain: Indicates measurements specifically within the midbrain. Anatomical/Method
ch13 / nbm Basal Forebrain: Indicates measurements within the basal forebrain, specifically from the CH13 or NBM atlases. Anatomical/Method
mtl Medial Temporal Lobe: Indicates measurements within the medial temporal lobe. Anatomical/Method
snseg Substantia Nigra Segmentation: Indicates a model or measurement focused on the substantia nigra. Anatomical/Method
deep Deep Brain Regions: Indicates measurements within deep gray matter structures. Anatomical/Method
deepcit Deep Brain Regions (CIT): A specific atlas for deep brain regions, likely derived from CIT168. Anatomical/Method

Executive Summary

This report presents a comprehensive analysis of the computational reproducibility between two program runs (Run A vs. Run B). Our methodology involves a detailed, variable-by-variable comparison, prioritizing a robust Symmetric Percentage Difference to assess numerical discrepancies. Variables are systematically grouped into classes using a custom, rule-based engine to identify systemic patterns of irreproducibility.

Overall Finding: The reproducibility between the two runs is excellent. A total of 100% of the 1934 analyzed variables were found to be either perfectly identical or within the acceptable numerical tolerance of 10^{-8}.

Key Issues Identified:

  • Structural Discrepancies: We identified 0 critical structural issue(s) (Type Mismatches or NA/Special-Value differences), which require immediate investigation as they indicate fundamental processing differences.
  • Significant Numerical Differences: A total of 0 variables exhibited significant numerical differences. The majority of these are concentrated in the N/A variable class, suggesting a potential instability in the algorithms related to this group of measurements.
  • Data Filtering: A total of 0 variables ( 0% of all common variables) were excluded from this analysis based on the script’s filtering rules (unclassified “other” types). These should be reviewed separately to ensure no important discrepancies are being missed.

This report will now provide a detailed breakdown of these findings for the 1934 variables included in the analysis.

1. Reproducibility Health Dashboard

This section provides a high-level overview of the comparison results.

1.1. Overall Status by Category

The summary table and chart below categorize every variable comparison. A healthy process will show the vast majority falling into the “Identical” and “Numerically Identical” categories.

Summary of Reproducibility Status by Category
category Count Percentage
Identical 1929 99.7%
Small Numeric Difference 5 0.3%

1.2. Distribution of Numerical Differences

For variables that were not identical, this histogram shows the magnitude of the Symmetric Percentage Difference (SPD). A healthy comparison will show differences clustered at the very low end of the scale. A long tail to the right indicates significant relative errors.


2. Deep Dive: Problematic Variables

This section isolates the most critical discrepancies for debugging. We prioritize structural issues first, followed by the largest numerical differences.

2.1. Structural Mismatches (Highest Priority)

These are non-negotiable failures in reproducibility and must be addressed first. They include changes in data type, or a value becoming NA/Inf between runs.

## ✅ **Excellent News:** No structural mismatches were found among the analyzed variables.

2.2. Top Numerically Discrepant Variables

The following variables exceeded our tolerance for reproducibility. They are ranked by the Symmetric Percentage Difference to highlight the largest relative errors.

## ✅ **Excellent News:** No significant numerical differences were found.

Interpretation: The “lollipop” plot provides a rapid visual assessment of where the largest relative errors lie. The table below provides the exact values for detailed inspection. Pay close attention to the Classification column to see if errors cluster within a specific measurement type.

3. Analysis by Variable Classification

By grouping variables according to anatomy/modality, we can determine if specific measurement types or anatomical classes are systematically less reproducible.

3.1. Reproducibility Status per Class

The chart below shows the proportional breakdown of reproducibility outcomes for each classified group. A “good” class is dominated by blue (“Identical”) and green (“Negligible Differences”). A “bad” class shows a significant slice of red (“Significant Difference”).

3.2. Distribution of Differences per Class

This plot directly visualizes the stability of each class. We are looking for classes whose distributions are tight and centered near zero. Classes with wide distributions (long boxes/whiskers) or high medians are less reproducible.

Interpretation: A class with a high median SPD (the line in the middle of the box) indicates a systematic, non-trivial difference between the two runs for that entire group of variables.

4. Conclusion & Actionable Recommendations

  • Overall Health: As stated, the overall reproducibility is excellent. This provides a high-level confidence score in the stability of the pipeline.

  • Critical Faults: The 0 identified structural issue(s) are the most severe category of error and represent the top priority for debugging. These are not numerical precision issues but fundamental logical or data-handling divergences.

  • Systemic Weaknesses: The analysis of reproducibility by variable class points directly to the N/A class as the primary source of significant numerical discrepancies. The algorithms used in both white matter hyperintensities and melanin classes take advantage of randomization to bootstrap estimates. As such, the results are by definition non-deterministic. It is possible to implement bootstrapped estimates that are deterministically random but this will require extra development effort and a new release for ANTsPy.

  • Bayesian WMH estimates: The white matter hyperintensity segmentation used here is fundamentally a non-deterministic process. The algorithm is designed to be robust to small changes in the data, but it is not guaranteed to produce the same results every time. This is by design, as the algorithm is intended to be used in part to model uncertainty (note: this could be done for any of these methods/imaging data phenotypes with equivalent justification).

  • Data Filtration: A separate review of these "other" variables may be warranted to ensure no significant issues are being overlooked.

underlying computation platform: MacBook-Pro.local 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:25 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6020 arm64