Input file names (“-thN-” indicates the numbers of threads used):
`run 1:PPMI-101018-20210412-1496225-th48-mmwide_sr_run1.csv
`run 2:PPMI-101018-20210412-1496225-th48-mmwide_sr_run1.csv
see here for source and results data.
The core purpose of this document is to provide a comprehensive and
quantitative assessment of the computational reproducibility of the
docs/mm_csv_localint.py
(docs/mm_csv_localint_sr.py) script for an
ANTsPyMM run on PPMI multiple modality MRI
(M3RI). The analysis is based on a direct comparison of the tabular
outputs generated from two independent executions of this script on the
same computer and the same M3RI collection.
To ensure a robust and meaningful comparison, both runs were
conducted within a standardized and controlled computational
environment, defined by a single, version-pinned
Dockerfile. This approach is critical as it minimizes
variability stemming from the underlying operating system, software
libraries, and their respective versions.
The Dockerfile, detailed below, defines the precise
environment for this experiment. Its key features contributing to
reproducibility are:
FROM tensorflow/tensorflow:2.17.0. This pins the base
operating system, Python version, core TensorFlow libraries, and
underlying CUDA/cuDNN versions, providing a stable foundation.antspy*
libraries are pinned to specific versions (e.g.,
antspyx==0.6.1), preventing unexpected changes from library
updates. While fundamental packages like numpy and
scipy are not explicitly version-pinned in the
pip install command, the use of the same built
Docker image for both runs ensures that identical versions were
used for this specific comparison.antsxnet, are downloaded and “baked”
into the environment at build time using git clone and
get_antsxnet_data.py. This prevents variability that could
arise from downloading data at runtime.FROM tensorflow/tensorflow:2.17.0
ENV HOME=/workspace
WORKDIR $HOME
# Set environment variables for optimal threading
ENV TF_NUM_INTEROP_THREADS=8 \
TF_NUM_INTRAOP_THREADS=8 \
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=8 \
OPENBLAS_NUM_THREADS=8 \
MKL_NUM_THREADS=8
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
git \
curl \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Upgrade pip and install Python libraries
RUN pip install --upgrade pip \
&& pip install \
psrecord \
numpy \
pandas \
scipy \
matplotlib \
scikit-learn \
ipython \
jupyterlab \
antspyx==0.6.1 \
antspynet==0.3.1 \
antspyt1w==1.1.3 \
antspymm==1.6.4 \
siq==0.4.1
# for downloading example data from open neuro
RUN pip3 --no-cache-dir install --upgrade awscli
###########
#
RUN git clone https://github.com/stnava/ANTPD_antspymm.git ${HOME}/ANTPD_antspymm
RUN python ${HOME}/ANTPD_antspymm/src/get_antsxnet_data.py ${HOME}/.keras
A critical feature of the Dockerfile is the explicit
setting of threading variables (e.g.,
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=8). This setup
deliberately prioritizes performance by enabling multi-threading for
computationally intensive libraries like ITK (used by ANTsR/ANTsPy) and
TensorFlow.
However, enabling parallelism has a direct and understood consequence for reproducibility: the order of floating-point operations in parallelized calculations is not guaranteed to be identical across runs. This can introduce minute, non-deterministic numerical variations. As such, the run for the experiments reported below are done with methodological and parameter choices that seek to maximize reproducibility while maintaining the performance benefits of parallel processing.
The primary objective of this report is not to achieve bit-wise identical outputs, but to verify that the results are statistically stable and reproducible within a predefined numerical tolerance. However, we will see that the results are nearly identical for the vast majority of variables, with only a few exceptions that are expected and understood.
Our analysis will proceed as follows:
MeasurementType_Atlas group using a custom, rule-based
engine. This allows us to assess whether specific parts of the
processing pipeline are more or less stable.2 * |RunA - RunB| / (|RunA| + |RunB|). This metric is
robust for comparing values across different scales and serves as our
primary measure of numerical discrepancy.1e-8).NA/Inf in one run but not the other.This multi-faceted approach allows us to confidently assess the stability of the pipeline, pinpointing both critical errors and areas of minor numerical variability for targeted review.
The variable classification names used throughout this report (e.g.,
dti.fa.cortex, t1.thk.mtl) are a composite of
a Measurement Type and an
Anatomical/Methodological Context. This glossary
defines the abbreviations and terms used to construct these classes,
providing a clear reference for interpreting the results.
| Term/Abbreviation | Description | Appears In |
|---|---|---|
| vol | Volume: A measure of the size of a structure, typically in mm³. | Measurement Type |
| thk | Thickness: A measure of geometric thickness, typically in mm. | Measurement Type |
| area | Area: A measure of surface area, typically in mm². | Measurement Type |
| dti | Diffusion Tensor Imaging: A general prefix for metrics derived from DTI data. | Measurement Type |
| fa | Fractional Anisotropy: A primary DTI measure of white matter integrity, reflecting the directionality of water diffusion. | Measurement Type |
| md | Mean Diffusivity: A primary DTI measure reflecting the average magnitude of water diffusion. | Measurement Type |
| t1 | T1w Hierarchical: Indicates a value
derived from the main antspyt1w hierarchical segmentation
and labeling process. |
Measurement Type |
| t1.vth | Direct Cortical Thickness: A specific
T1-based cortical thickness measurement (t1vth). |
Measurement Type |
| melanin | Neuromelanin: Indicates a measure derived from neuromelanin-sensitive imaging pipelines. | Measurement Type |
| rsf | Resting-State fMRI: A general prefix for metrics derived from resting-state functional MRI data. | Measurement Type |
| falff | Fractional Amplitude of Low-Frequency Fluctuations: An rs-fMRI measure of the relative aplitude of brain activity. | Measurement Type |
| peraf | Percent Absolute Fluctuation: An rs-fMRI measure derived from the original ALFF, defined as a percentage. | Measurement Type |
| p[1,2,3] | Parameter Set [1,2,3]: Refers to one of three different rs-fMRI processing parameter sets used. | Measurement Type |
| dfn.corr | Default Mode Network Correlation: A correlation value specifically related to the Default Mode Network. | Measurement Type |
| oth.corr | Other Network Correlation: A correlation value related to functional networks other than the DMN. | Measurement Type |
| cortex | Cortex: Indicates that the measurement pertains to regions within the cerebral cortex. | Anatomical/Method |
| cerebell | Cerebellum: Indicates that the measurement pertains to regions within the cerebellum. | Anatomical/Method |
| wm | White Matter: Indicates measurements within white matter tracts. | Anatomical/Method |
| bst | Brain Stem: Indicates measurements within the brain stem. | Anatomical/Method |
| midbrain | Midbrain: Indicates measurements specifically within the midbrain. | Anatomical/Method |
| ch13 / nbm | Basal Forebrain: Indicates measurements within the basal forebrain, specifically from the CH13 or NBM atlases. | Anatomical/Method |
| mtl | Medial Temporal Lobe: Indicates measurements within the medial temporal lobe. | Anatomical/Method |
| snseg | Substantia Nigra Segmentation: Indicates a model or measurement focused on the substantia nigra. | Anatomical/Method |
| deep | Deep Brain Regions: Indicates measurements within deep gray matter structures. | Anatomical/Method |
| deepcit | Deep Brain Regions (CIT): A specific atlas for deep brain regions, likely derived from CIT168. | Anatomical/Method |
This report presents a comprehensive analysis of the computational
reproducibility between two program runs (Run A
vs. Run B). Our methodology involves a detailed,
variable-by-variable comparison, prioritizing a robust Symmetric
Percentage Difference to assess numerical discrepancies.
Variables are systematically grouped into classes using a custom,
rule-based engine to identify systemic patterns of
irreproducibility.
Overall Finding: The reproducibility between the two runs is excellent. A total of 100% of the 1934 analyzed variables were found to be either perfectly identical or within the acceptable numerical tolerance of 10^{-8}.
Key Issues Identified:
This report will now provide a detailed breakdown of these findings for the 1934 variables included in the analysis.
This section provides a high-level overview of the comparison results.
The summary table and chart below categorize every variable comparison. A healthy process will show the vast majority falling into the “Identical” and “Numerically Identical” categories.
| category | Count | Percentage |
|---|---|---|
| Identical | 1929 | 99.7% |
| Small Numeric Difference | 5 | 0.3% |
For variables that were not identical, this histogram shows the magnitude of the Symmetric Percentage Difference (SPD). A healthy comparison will show differences clustered at the very low end of the scale. A long tail to the right indicates significant relative errors.
This section isolates the most critical discrepancies for debugging. We prioritize structural issues first, followed by the largest numerical differences.
These are non-negotiable failures in reproducibility and must be
addressed first. They include changes in data type, or a value becoming
NA/Inf between runs.
## ✅ **Excellent News:** No structural mismatches were found among the analyzed variables.
The following variables exceeded our tolerance for reproducibility. They are ranked by the Symmetric Percentage Difference to highlight the largest relative errors.
## ✅ **Excellent News:** No significant numerical differences were found.
Interpretation: The “lollipop” plot provides a rapid
visual assessment of where the largest relative errors lie. The table
below provides the exact values for detailed inspection. Pay close
attention to the Classification column to see if errors
cluster within a specific measurement type.
By grouping variables according to anatomy/modality, we can determine if specific measurement types or anatomical classes are systematically less reproducible.
The chart below shows the proportional breakdown of reproducibility outcomes for each classified group. A “good” class is dominated by blue (“Identical”) and green (“Negligible Differences”). A “bad” class shows a significant slice of red (“Significant Difference”).
This plot directly visualizes the stability of each class. We are looking for classes whose distributions are tight and centered near zero. Classes with wide distributions (long boxes/whiskers) or high medians are less reproducible.
Interpretation: A class with a high median SPD (the line in the middle of the box) indicates a systematic, non-trivial difference between the two runs for that entire group of variables.
Overall Health: As stated, the overall reproducibility is excellent. This provides a high-level confidence score in the stability of the pipeline.
Critical Faults: The 0 identified structural issue(s) are the most severe category of error and represent the top priority for debugging. These are not numerical precision issues but fundamental logical or data-handling divergences.
Systemic Weaknesses: The analysis of reproducibility by variable class points directly to the N/A class as the primary source of significant numerical discrepancies. The algorithms used in both white matter hyperintensities and melanin classes take advantage of randomization to bootstrap estimates. As such, the results are by definition non-deterministic. It is possible to implement bootstrapped estimates that are deterministically random but this will require extra development effort and a new release for ANTsPy.
Bayesian WMH estimates: The white matter hyperintensity segmentation used here is fundamentally a non-deterministic process. The algorithm is designed to be robust to small changes in the data, but it is not guaranteed to produce the same results every time. This is by design, as the algorithm is intended to be used in part to model uncertainty (note: this could be done for any of these methods/imaging data phenotypes with equivalent justification).
Data Filtration: A separate review of these
"other" variables may be warranted to ensure no significant
issues are being overlooked.
underlying computation platform: MacBook-Pro.local 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:25 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6020 arm64