AirfRANS reynolds extrapolation — envelope-weakener error-gap (vs RANS ground truth)
  group sizes: fired=498  not-fired=99

True-error distributions (|prediction − RANS reference|), grounded measurements:
  coeff group        count      median        mean                   IQR
  cl    fired          498      0.1461      0.1801     [0.08145, 0.2358]
  cl    unfired         99     0.07222      0.0933     [0.02959, 0.1186]
  cd    fired          498     0.04711     0.05881    [0.02026, 0.08827]
  cd    unfired         99      0.0215     0.02526    [0.01077, 0.03615]

Headline (median true-error ratio, fired ÷ not-fired):
  CL: 2.0× (flagged-case error is 2.0× the unflagged-case error)
  CD: 2.2× (flagged-case error is 2.2× the unflagged-case error)

Plausibility of flagged predictions (the invisible-danger trap):
  165/498 flagged predictions are finite and in a believable Cl/Cd range,
  yet their median Cd true error is 0.01478 — plausible-looking, measurably wrong.

Scope: this measures out-of-envelope inadequacy specifically — a surrogate accurate in-envelope but silently wrong outside it. It does not measure in-envelope-but-still-bad surrogates, which is a different check.
