AirfRANS aoa extrapolation — envelope-weakener error-gap (vs RANS ground truth)
  group sizes: fired=196  not-fired=161

True-error distributions (|prediction − RANS reference|), grounded measurements:
  coeff group        count      median        mean                   IQR
  cl    fired          196      0.0614     0.09996     [0.03223, 0.1286]
  cl    unfired        161     0.05117     0.09031      [0.0248, 0.1056]
  cd    fired          196      0.0223     0.02711    [0.01074, 0.04178]
  cd    unfired        161     0.01372     0.01596   [0.005782, 0.02339]

Headline (median true-error ratio, fired ÷ not-fired):
  CL: 1.2× (flagged-case error is 1.2× the unflagged-case error)
  CD: 1.6× (flagged-case error is 1.6× the unflagged-case error)

Plausibility of flagged predictions (the invisible-danger trap):
  125/196 flagged predictions are finite and in a believable Cl/Cd range,
  yet their median Cd true error is 0.01532 — plausible-looking, measurably wrong.

Scope: this measures out-of-envelope inadequacy specifically — a surrogate accurate in-envelope but silently wrong outside it. It does not measure in-envelope-but-still-bad surrogates, which is a different check.
