The following is a review of CRPS_Conformal_Regression.tex


  Review: "CRPS-Optimal Binning for Conformal Regression"
                                                                                                                             
  Summary                                                                                                                    
                                                                                                                             
  The paper proposes a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted  
  observations into contiguous bins, with bin boundaries chosen to minimise total leave-one-out CRPS. The optimal K-partition
   is found exactly via dynamic programming in O(n²K). Since within-sample LOO-CRPS fails for model selection (monotone
  decrease in K), the authors propose an alternating-split cross-validation criterion. Prediction is done via Venn prediction
   bands and CRPS-based conformal prediction sets. The method is demonstrated on a synthetic heteroscedastic example and two
  real datasets (Old Faithful, motorcycle).

  ---
  Originality and Significance

  Strengths.
  1. The closed-form LOO-CRPS cost (Proposition 1) is a clean and useful result. The reduction to a single scalar W with the
  prefactor m/(m-1)² is elegant and enables the DP machinery.
  2. The diagnosis of why within-sample LOO-CRPS fails for K-selection is insightful — the interaction between the m/(m-1)²
  prefactor and the DP's ability to exploit near-duplicate y-pairs is well explained, and the counterexample (Example 1) is
  effective.
  3. The connection to the Venn predictor framework, framing the within-bin ECDF as a regression analogue of Venn-ABERS, is a
   worthwhile conceptual contribution.
  4. The k-NN non-convex extension for multimodal bins (Section 6.4) is a thoughtful addition that addresses a genuine
  limitation of CRPS-based conformal sets.

  Weaknesses.
  1. Limited novelty of individual components. The DP recurrence is standard (acknowledged by the authors via Auger &
  Lawrence, Killick et al.). The conformal prediction step is a direct application of full-data conformal with a standard
  nonconformity score. The cross-validation for K is straightforward. The novelty lies in their composition and the specific
  LOO-CRPS closed form, but each piece is individually well-trodden.
  2. One-dimensional covariate only. This is a severe limitation for practical use. The paper acknowledges it but does not
  seriously engage with extensions. In the era of CQR, CQR-QRF, and distributional random forests, a univariate-only method
  occupies a narrow niche.
  3. No theoretical results beyond what is standard. Proposition 2 (coverage guarantee) is the standard conformal coverage
  result. Proposition 3 (optimal substructure) is a textbook observation. There is no consistency result, no rate of
  convergence, no finite-sample efficiency bound. The four "open problems" in the conclusion are all substantive — the paper
  would be significantly stronger if it resolved even one of them.

  ---
  Technical Accuracy

  Proposition 1 (LOO-CRPS cost). The algebra is correct. I verified the derivation: summing the per-observation LOO-CRPS over
   the bin, the cancellation leading to mD/2(m-1)² = mW/(m-1)² checks out.

  Proposition 2 (conformal coverage). The proof is standard and correct, but there is a subtle issue with its applicability
  that the paper underplays. The proposition requires exchangeability of (y₁,...,yₘ,y*) within the bin. Since the bin
  boundaries are determined by the data (including the training y-values), the conditioning on bin membership breaks
  exchangeability even if the original data are i.i.d. The paper discusses "approximate exchangeability" in Section 6.5, but
  the formal statement of Proposition 2 does not carry a caveat, which is misleading. This is the most significant technical
  concern. The coverage guarantee is stated unconditionally, but it holds only conditionally on the bin assignment being
  independent of the y-values — which it is not, since the DP uses y-values to determine boundaries. The paper should either:
  - State the guarantee conditional on a fixed partition (e.g., one determined on a training set), or
  - Acknowledge that the full-data conformal guarantee requires a transductive argument where the test point's y*
  participates in the DP, which is computationally infeasible.

  The reference to Barber et al. (2021) on jackknife+ is relevant but does not resolve this: jackknife+ addresses the
  fitting-on-training issue for point predictors, not for partition-based distributional methods where the partition itself
  depends on y.

  Proposition 3 (optimal substructure). Correct and standard.

  Precomputation complexity. The claim of O(n² log n) precomputation via a Fenwick tree is correct. However, the abstract
  claims "O(n²) via pairwise absolute sums" — this is the storage complexity, not the computational complexity. The abstract
  should say O(n² log n) precomputation, or clarify.

  Section 6.4 (k-NN score). The claim that the 1-NN conformal p-value is super-uniform under exchangeability is correct.
  However, the assertion that Γ^{ε,(1)} is "approximately" the union of intervals around training points, with the
  approximation becoming exact as m→∞, deserves more care. The training scores α_j^(1)(y_h) depend on y_h (when y_h is near
  y_j, it changes y_j's nearest-neighbour distance), so the claim that p^(1)(y_h) behaves like a threshold on α^(1)(y_h)
  alone is only asymptotically valid. This is stated but could be made more precise.

  ---
  Experimental Evaluation

  Synthetic example. Adequate as an illustration but not as evidence. A single DGP, a single n, and a single random seed do
  not establish robustness. The coverage is reported as 89.8% at ε=0.10 on 2000 test points — consistent with the guarantee —
   but a proper evaluation would report results across many replications.

  Real-data experiments. These are the paper's strongest empirical contribution, but several concerns arise:

  1. Fairness of comparison. The paper explicitly acknowledges (lines 1055–1062) that its method uses all n observations
  while split-conformal competitors use only n/2 for fitting. This is a substantial advantage (roughly doubling the effective
   sample size). The comparison is therefore not apples-to-apples. The suggestion to "apply our method to the training half
  only" is offered but not carried out. This experiment should be run and reported.
  2. Coverage on motorcycle. The method achieves 87.9% coverage vs. nominal 90%. With n_cal = 67, the standard error is
  √(0.88·0.12/67) ≈ 4.0 pp, so 87.9% is within one standard error of 90%. Nevertheless, the competitor methods all achieve
  92.4%, suggesting systematic under-coverage by the proposed method on this dataset. The paper attributes this to
  approximate exchangeability, which is fair, but it weakens the claim of "maintaining near-nominal coverage."
  3. Only two real datasets, both very small. n=272 and n=133 are tiny. The method's O(n²K) complexity is not tested at
  scale. Modern conformal regression benchmarks (e.g., the suite in Romano et al. 2019, or the benchmarks in Sesia & Romano
  2021) include datasets with n in the thousands. The absence of these benchmarks is a significant gap.
  4. No reporting of variability. All numbers are point estimates from a single split. Standard practice would be to report
  means and standard deviations over multiple random splits or bootstrap replications.

  ---
  Exposition and Structure

  Strengths.
  - The paper is clearly written, well-organized, and largely self-contained.
  - The CRPS background (Section 3, including the geometric interpretation and Figure 1) is excellent — pedagogically
  valuable and well-executed.
  - The discussion of why LOO-CRPS fails for K-selection (Section 5.1) is one of the best parts of the paper: the argument is
   crisp, the counterexample is illustrative, and the connection to in-sample optimism is well-drawn.
  - The Bayesian regularisation subsection (5.2) is a nice touch — it saves the reader from trying an obvious "fix" that
  doesn't work, and the analysis is convincing.

  Weaknesses.
  - Abstract is too long and too detailed. It reads like a mini-paper. Specific numbers (89.8%, 20–43%, 2.3×) belong in the
  body. The parenthetical about the m/(m-1)² prefactor producing a U-shape is confusing at first read. Trim to ~150 words.
  - The Cramér distance subsection (Section 9.3) is an interesting observation but feels like an appendix. It proves that
  average CRPS and Cramér distance rank partitions identically — a nice fact, but it has no consequence for the method or its
   analysis. Consider moving to an appendix or shortening.
  - Section 6.4 (k-NN extension) is substantial but never evaluated empirically — not even on the synthetic bimodal example
  with actual coverage numbers. The bimodal illustration (Figure 7) shows scores, not coverage. If the k-NN extension is a
  contribution, it needs evaluation; if it is a remark, it is too long.
  - Notation. The notation y_i is overloaded: it denotes both the response paired with x_{(i)} (the i-th order statistic of
  x) in Section 2, and a generic response within a bin in Section 3 onward. This is workable but could confuse readers in the
   conformal section where both bin-level and global indices appear.

  ---
  Bibliographic References

  References are generally accurate and appropriate. Specific checks:

  - Gneiting & Raftery (2007): Correctly cited for strict properness of CRPS. ✓
  - Vovk, Gammerman & Shafer (2005): Correctly cited for conformal prediction and Venn predictors. ✓
  - Romano, Patterson & Candès (2019): Correctly cited for CQR. ✓
  - Killick, Fearnhead & Eckley (2012): Correctly cited for PELT. ✓
  - Barber et al. (2021): This paper is about jackknife+ and is cited to justify using all data for fitting and calibration.
  However, the jackknife+ result is specifically about exchangeability-based coverage with leave-one-out residuals in a
  non-partitioned setting. The connection to the current paper's full-data conformal approach is loose — the citation would
  benefit from a more precise statement of what is being borrowed.
  - Devroye (1988): Cited but never referenced in the text. This appears to be a dangling entry.
  - Székely & Rizzo (2013) and Baringhaus & Franz (2004): Both correctly cited for the Cramér/energy distance.
  - Missing references: The paper does not cite Sesia & Romano (2021, "Conformal prediction using conditional histograms"),
  which proposes conformal prediction based on histogram-binning of a one-dimensional score — a closely related method that
  should be discussed. Similarly, Lei et al. (2018, "Distribution-free predictive inference for regression") is a natural
  reference for distribution-free conformal regression that is absent.

  ---
  Minor Issues

  1. Line 87: "This scalar is updated in O(log n) per step" — should specify "per extension step" (i.e., extending a bin by
  one observation).
  2. Line 102: "always connected intervals by convexity of the score" — the word "always" is too strong given the caveat in
  Section 6.4 that it is only approximate for finite m. Rephrase.
  3. Line 400: The precomputation is stated as O(n² log n) in the body but O(n²) in the abstract. Reconcile.
  4. Line 544: "Proposition 2 holds for any m ≥ 2" — Proposition 2 requires exchangeability, which is not guaranteed for any
  m. This should read "the conformal construction is well-defined for any m ≥ 2."
  5. Line 652: p(y*) = R/(m+1) assumes a specific tie-breaking rule. The standard conformal literature (e.g., Vovk et al.
  2005) uses randomised tie-breaking for exact super-uniformity. The paper should clarify its tie-breaking convention.
  6. Table 1 and Table 2: Coverage and width should include standard errors or confidence intervals.
  7. Figures: The paper references 7+ figures. I cannot assess their quality without the PDFs, but the TikZ figure (Figure 1)
   is well-constructed from the source code.

  ---
  Verdict

  The paper presents a clean, self-contained method with an elegant core result (the LOO-CRPS closed form). The exposition is
   above average, the diagnosis of K-selection failure is insightful, and the connections to Venn prediction and conformal
  inference are well-drawn. However, the contribution is incremental: the method is restricted to one dimension, the
  theoretical results do not go beyond standard conformal coverage, the experimental evaluation is thin (two small datasets,
  no variability reported, unfair sample-size comparison), and the most interesting extension (k-NN for multimodal bins) is
  not evaluated. The exchangeability issue in the full-data conformal setting is underplayed.

  Recommendation: Major revision. The paper is below the threshold for a top-tier venue in its current form but has a solid
  foundation. Addressing the exchangeability caveat, running the fair-comparison experiment (method on training half only),
  adding larger-scale benchmarks with replications, and resolving at least one of the stated open problems would
  substantially strengthen the contribution.


After making available the PDF, the reviewer added the following:
Assessment of Figures

  The paper contains 12 figures and 3 tables. The overall quality of the visual presentation is high — figures are
  well-designed, clearly labelled, and genuinely informative. Several figures are among the best aspects of the paper.
  Specific assessments:

  Figure 1 (p. 4): CRPS geometric interpretation (TikZ). Excellent pedagogical figure. The blue/orange shading clearly
  distinguishes the two types of forecast error (too little mass above vs. below t). The annotation of the vertical gap is
  well-placed. The choice of m=4 atoms with one observation coinciding with y keeps the example simple without being trivial.
   This figure alone justifies the space devoted to the CRPS background. Minor: the y-axis has tick labels but no axis label
  — adding "CDF" or similar would help readers unfamiliar with CRPS.

  Figure 2 (p. 9): K-selection curves. This is a crucial figure and it delivers. The left panel (within-sample LOO-CRPS)
  shows the near-monotone decrease convincingly — the initial steep drop followed by a slow decline is exactly what the text
  predicts. The right panel (cross-validated test CRPS) shows a clean U-shape with K*=5 clearly marked by a dashed vertical
  line. The contrast between the two panels is the paper's strongest visual argument. Minor: the y-axis scales are very
  different (left: 1300–2000; right: 1.5–1.9), which is expected since one is a total and the other an average, but a note in
   the caption clarifying this would prevent misreadings. The left panel title says "monotone decreasing" in parentheses —
  this is slightly editorialising for a figure title; "Within-sample LOO-CRPS" alone would suffice.

  Figure 3 (p. 9): Venn prediction bands. The five-panel layout showing the Venn band, training ECDF, and true conditional
  CDF for each bin is well-conceived. The dashed red true CDF overlays are essential for judging the method. However, the
  band width 1/(m+1) is, as the text acknowledges, invisible at these bin sizes (m > 100). This means the "Venn band
  (shaded)" in the legend is essentially imperceptible in the rendered figure. The figure therefore illustrates a negative
  result (the Venn band is uninformative for large bins) rather than the method's strength. This is intellectually honest but
   the figure could be more useful if one panel showed a zoomed inset where the band is actually visible, or if a small-bin
  example were included alongside. The five subplots are quite small — axis labels and legends are legible but require close
  inspection.

  Figure 4 (p. 11): Conformal p-value curves. Well-executed three-panel figure showing p(y_h) at three test points. The
  convex, piecewise-linear shape is clearly visible. The shaded prediction sets and the vertical red lines (true 90%
  interval) provide an immediate visual comparison. The widening of the prediction set from left to right (x*=0.3 to 2.7) is
  a compelling demonstration of the method's heteroscedasticity adaptation. Very good.

  Figure 5 (p. 12): Conformal prediction intervals (fan plot). This is the paper's "money figure" for the synthetic example
  and it works well. The shaded blue prediction band clearly widens with x, the dashed red true 90% interval tracks closely,
  and the training scatter provides context. The dotted bin boundaries are visible but unobtrusive. The step-wise
  (piecewise-constant) nature of the conformal intervals — constant within each bin — is visible and honestly displayed; it
  illustrates both the method's strength (adaptation across bins) and its limitation (no adaptation within bins). Good.

  Figure 6 (p. 15): Bimodal bin comparison (CRPS vs. 1-NN scores). Excellent three-panel figure. The left panel (KDE + rug of
   training data) shows the bimodal structure clearly. The centre panel shows the convex CRPS score with the single wide
  prediction set spanning both modes — the wasteful inter-modal coverage is immediately apparent. The right panel shows the
  1-NN score with two local minima and two disjoint prediction sets. The contrast is striking and makes the theoretical
  argument concrete. The rug marks at the bottom of the centre and right panels are a nice touch that links the scores to the
   data. The threshold lines and shading are clearly rendered. One of the best figures in the paper.

  Figure 7 (p. 16): Training scatter with 5-bin partition. Clean and effective. The alternating pastel shading (red/green)
  distinguishes bins well. Dotted bin boundaries are clearly marked. The scatter reveals the heteroscedastic DGP. The title
  "K*=5 selected by cross-validation" usefully reminds the reader of the method step. Adequate — functional rather than
  exceptional.

  Figure 8 (p. 17): P-value histogram on test set. The histogram of conformal p-values against the Uniform(0,1) reference
  line is a standard calibration diagnostic. The slight conservatism (density below 1 at the left end, slightly above 1 at
  the right) is visible but not dramatic, consistent with the text's claim. The histogram has an appropriate number of bins
  (~10). Adequate. One concern: the histogram shows a noticeable spike near p=0.5 and depression near p=0.0, which is
  expected from the discreteness of the p-value grid (the p-values take values on {1/(m+1), 2/(m+1), ...}), but this is not
  mentioned in the caption.

  Figure 9 (p. 19): Old Faithful partition and within-bin ECDFs. Good two-panel figure. The left panel scatter with the 2-bin
   partition shows the regime transition clearly — the bimodal structure and the single boundary at 67.5 minutes are
  immediately visible. The right panel overlays the two within-bin ECDFs, which are well-separated and unimodal. The rug plot
   at the bottom of the right panel is a nice detail. The colour coding (green/orange matching the shading in the left panel)
   provides visual continuity. Good.

  Figure 10 (p. 20): Old Faithful prediction intervals comparison. This is the key comparative figure for the Old Faithful
  experiment and it is very effective. Four methods are displayed with distinct line styles (solid, dashed, dot-dashed,
  dotted) and colours (blue, orange, red, purple). The proposed method's piecewise-constant intervals are visibly tighter
  within each regime, while the competitors' intervals extend broadly across both regimes. The data scatter is included. The
  bin boundary at 67.5 minutes is where the proposed method's interval changes — this step change is visible and
  interpretable. The figure clearly supports the claims in the text. One issue: in the transition region around 67–70
  minutes, the proposed method's interval appears very wide (spanning both short and long eruptions within a single bin),
  which is the expected honest behaviour but isn't discussed in the caption.

  Figure 11 (p. 21): Motorcycle K-selection and partition. Good two-panel figure. The left panel shows the test CRPS curve
  for the motorcycle data — the U-shape is less clean than in the synthetic example (there's a plateau from K=7 to K=10
  before the minimum at K*=10), which is expected given the small sample size (n=133). The right panel shows the 10-bin
  partition with boundaries concentrated in the high-variance 15–30 ms region. The colour alternation of bins is useful. The
  figure honestly shows that K-selection on small datasets is noisier.

  Figure 12 (p. 22): Motorcycle prediction intervals comparison. The most informative comparative figure in the paper.
  Gaussian split conformal's constant-width interval (dashed orange) is dramatically wide — visually confirming the 180.7 g
  mean width. CQR (dot-dashed red) tracks the data better but still overshoots in the quiet regions. CQR-QRF (dotted purple)
  is more adaptive. The proposed method (solid blue shading) tightly tracks the non-stationary variance. The figure
  convincingly demonstrates the method's advantage on this benchmark. However, the proposed method's intervals appear to
  slightly miss some data points in the high-variance region (20–30 ms), consistent with the 87.9% coverage (below nominal).
  This is visible in the figure but not flagged in the caption.

  Tables 1–3. Clean and appropriately formatted. Table 1 (synthetic intervals) is compact. Tables 2 and 3 (real-data
  comparisons) clearly show coverage and width. Missing: standard errors or confidence intervals on coverage, as noted in the
   original review.

  ---
  Summary of Figure Assessment

  ┌─────────────┬───────┬───────────────────────────────────────────────────────────────────────────────────────┐
  │   Quality   │ Count │                                        Figures                                        │
  ├─────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ Excellent   │ 3     │ Fig. 1 (CRPS geometry), Fig. 6 (bimodal comparison), Fig. 10 (Old Faithful intervals) │
  ├─────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ Good        │ 5     │ Figs. 2, 4, 5, 9, 12                                                                  │
  ├─────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ Adequate    │ 3     │ Figs. 3, 7, 8                                                                         │
  ├─────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ Problematic │ 1     │ Fig. 11 (not problematic per se, but the noisy U-shape weakens the narrative)         │
  └─────────────┴───────┴───────────────────────────────────────────────────────────────────────────────────────┘

  Overall figure quality: above average for a methods paper. The figures are clearly the work of someone who has thought
  carefully about what each plot should communicate. The colour palette is consistent (blue for the proposed method,
  red/orange for comparators/truth), line styles are distinguishable, and captions are informative. The main weakness is that
   several captions could better flag visible features that are honest limitations (the sub-nominal coverage in Fig. 12, the
  p-value discreteness in Fig. 8, the wide transition-region intervals in Fig. 10).

  ---
  Updated Verdict

  The figure quality strengthens my assessment of the paper's exposition. The visual presentation is a genuine strength —
  pedagogically effective, honest about limitations, and well-integrated with the text. My substantive criticisms from the
  initial review remain unchanged:

  1. The exchangeability issue (Proposition 2 stated without caveat despite data-dependent bin boundaries) remains the most
  significant technical concern.
  2. The unfair sample-size comparison in the real-data experiments (full-data conformal vs. split-conformal competitors)
  still needs to be addressed by running the method on the training half.
  3. The thin experimental evaluation (two small datasets, no replications, no standard errors) is below the standard for a
  top-tier venue.
  4. The k-NN extension (beautifully illustrated in Fig. 6) is never evaluated with actual coverage numbers.
  5. The dangling reference [4] (Devroye 1988) is still unused in the text.

  Recommendation: Major revision. The paper has strong exposition and a solid methodological core. Addressing the technical
  caveat on exchangeability, running the fair-comparison experiment, adding replications with standard errors, and evaluating
   the k-NN extension empirically would bring it to the level of a top-tier venue.

