I have read the full manuscript. Here is my review.

  ---
  Review of "Optimal Binning for Conditional Distribution Estimation via Leave-One-Out CRPS Minimisation"

  Summary. The paper proposes a three-step pipeline: (1) partition covariate-sorted observations into K contiguous bins by minimising a closed-form LOO-CRPS cost via
  dynamic programming; (2) select K by cross-validated test CRPS; (3) form conformal prediction sets using the within-bin ECDF as both predictive distribution and
  nonconformity score. The idea is original, the technical execution is largely solid, and the real-data experiments are illustrative. However, the paper has several
  inconsistencies, unsupported claims, and structural weaknesses that would need to be addressed before acceptance at a top venue.

  ---
  1. Internal Inconsistencies

  Abstract contradicts body (monotonicity claim). The abstract states the DP "monotonically reduces [within-sample LOO-CRPS]." Section 5.1 then softens this to "tends to
  decrease nearly monotonically" and proves a concrete counterexample (Example 1). The abstract should be updated to reflect the nuanced position in the body.

  "Typically" vs. "always" for the interval structure. Section 7.2 states Γ^ε is "typically a single interval centred near the empirical median." Section 7.4 proves it is
   always an interval (the test score α(y_h) is convex in y_h; its sublevel sets are intervals). The hedge in §7.2 is contradicted by the proof two sections later and
  should be removed.

  Section cross-reference error. Section 9.2 (motorcycle) references "Section~\ref{sec:bias-variance}" and "Section~5.3", but the bias–variance discussion is §10.2 and
  the CV criterion is §6.3. At least one of these labels is wrong.

  Contribution 4 vs. practical conclusion. The Introduction lists the Venn prediction band as a key contribution. Section 8 concludes it "carries no practically useful
  information beyond the ECDF itself." If the band is of marginal practical value, its prominence in the contributions list is misleading; it should be reframed as a
  theoretical connection to the Venn-ABERS framework, not a usable output.

  Abstract missing real-data results. The abstract describes only the synthetic n=200 example. The real-data experiments (Old Faithful, motorcycle) are arguably more
  impactful and more relevant to the claimed advantages; the abstract should mention them, even briefly.

  ---
  2. Inaccurate or Overconfident Claims

  "CRPS is uniquely suited to this purpose" (§1, Introduction). Three properties are listed, but the argument for uniqueness is not made. Other proper scoring rules
  (e.g., the energy score generalised to higher dimensions) share some of these properties. "Particularly well suited" is defensible; "uniquely suited" is not.

  Proof of Proposition 1 is absent. The key technical result — cost(S) = mW/(m-1)² — is stated but the derivation is abbreviated to "summing over k ∈ S and using Σ_k d_k
  = D." For a theory paper, this is insufficient. The algebra is non-trivial (it involves showing the cross term telescopes correctly) and should be shown explicitly,
  even as a one-paragraph calculation. A reviewer will check it.

  "Most other proper scoring rules do not admit an analogously compact closed form" (§10.3). This is asserted without proof or reference. The log score (for ECDF-based
  predictions) and the Brier score (for CDFs evaluated at a finite grid) may or may not have compact LOO formulas. The claim needs either a proof or removal.

  Bayesian regularisation claim (§6.2): "empirically, the Bayesian formulation produces more size-2 bins." This is stated without empirical demonstration or theoretical
  proof. The mechanistic argument (reduced pairwise coefficient dominates the prior CRPS floor) is plausible but not verified. Either include a supporting experiment or
  state it as a conjecture.

  Prediction set is always a single interval (§7.4). The convexity argument applies to the test score α(y_h). But the p-value p(y_h) = #{j: α_j(y_h) ≥ α(y_h)}/(m+1)
  involves training scores α_j(y_h), which are also functions of y_h through the augmented set. For small m, these can produce a non-monotone p-value. The paper asserts
  the interval claim is "not incidental" but the proof only covers the test score, not the full p-value. This gap should be addressed explicitly — either prove it for the
   full p-value or clearly state the claim holds only for large m.

  "Confirms the super-uniform guarantee" (§8, empirical coverage). A histogram of conformal p-values on a single test set of 2,000 points is not a confirmation of a
  theoretical guarantee; it is consistency with it. The wording should be softened.

  Coverage of 87.9% (motorcycle) presented as "near-nominal." Standard conformal theory guarantees coverage ≥ 1−ε = 90%, not approximately equal to 90%. Coverage of 87.9%
   is technically below the nominal guarantee, which is a direct consequence of the approximate (not exact) exchangeability within bins. This sub-nominal result should be
   explicitly discussed as a deviation from theory due to exchangeability violation, not merely described as "near-nominal."

  ---
  3. Methodological Weaknesses

  Unequal information for competitors. The proposed method uses all n observations for both partitioning and conformal calibration. Competitors (Gaussian conformal, CQR)
  use only the training half for model fitting and the calibration half for conformalization. This gives the proposed method an effective sample size advantage of roughly
   2×. The paper acknowledges this in one sentence but does not analyse whether the coverage/width advantage survives under a matched sample-size comparison. This is a
  significant confound.

  QRF comparison is confounded. QRF is included without conformal calibration. Its undercoverage (76.5% on Faithful, 81.8% on motorcycle) reflects the absence of
  calibration, not the quality of the forest estimates. Either use conformalized QRF (directly comparable) or relegate it to a footnote noting the missing calibration
  step.

  Single-draw synthetic experiment. The synthetic experiment (§8) reports coverage from a single random draw. No Monte Carlo replications, no error bars. Coverage
  estimates on a single n=200 training set and 2,000 test points have non-trivial variance. Table 1 (conformal intervals at three x*-values) likewise comes from a single
  realisation.

  No multimodal synthetic example. The k-NN nonconformity score is motivated by the failure of CRPS to produce non-contiguous sets for bimodal within-bin distributions
  (§7.4). However, no experiment illustrates this scenario. The Old Faithful experiment works because the partition separates the two modes, leaving each bin unimodal. A
  synthetic bimodal experiment where bins are intentionally set to K=1 (or K is too small) would be the natural demonstration.

  K_max choice undiscussed. K_max = 10 or 20 is used throughout without justification. The U-shaped criterion needs to have its minimum within [1, K_max]; if K_max is too
   small the true minimum is never found. No sensitivity analysis is provided.

  ---
  4. Missing Formal Content

  Proposition 2 (coverage guarantee) has no proof or proof sketch. The statement is given; the "proof" is one sentence ("follows from the standard conformal argument").
  For a venue where conformal prediction is not assumed background, the paper should either provide the standard two-line proof (exchangeability → uniform rank →
  coverage) or give an explicit citation of the theorem being applied from Vovk et al. 2005.

  No analysis of K consistency.* The cross-validated K* is proposed and illustrated, but there is no analysis of whether K* converges to a meaningful oracle K as n → ∞,
  or even whether TestCRPS(K) is guaranteed to have a unique minimum in population. This is a gap relative to, e.g., the consistency results available for cross-validated
   bandwidth selection in kernel density estimation.

  Minimum bin size issue. The DP formula cost(i,j) = (j−i+1)/(j−i)² × W(i,j) is undefined for i=j (m=1, division by 0). The recurrence uses i < j but the base case
  dp[1][j] = c(1,j) includes j=1 (single-observation bin). The paper does not address this degeneracy anywhere in the theory. The code enforces min bin size 2, but this
  is never stated as a constraint in the formal setup.

  ---
  5. Organisational Issues

  The bias–variance discussion (§10.2) arrives too late. The approximate-exchangeability argument is essential for interpreting the synthetic results in §8 and the
  motorcycle K*=10 result in §9.2. The reader encounters conformal intervals that are approximately valid (not exactly valid) in §8 without conceptual preparation.
  Section 10.2 should be moved to immediately after the conformal prediction section (§7), or at least cross-referenced explicitly from §8.

  The k-NN extension (§7.4) is disproportionately long. The section occupies roughly as much space as the entire conformal prediction section (§7.1–7.3) yet contains no
  formal results, no experiments, and no implementation. It reads as a speculative excursion. It would be appropriate either to (a) formalise it with a proposition and
  experiment, or (b) reduce it to a remark of two paragraphs.

  The Cramér distance discussion (§10.4) adds no actionable content. The equivalence result (eq. cramer-crps) is mathematically interesting but the section concludes that
   the two criteria are "identical for every purpose that matters in our setting." If that is so, §10.4 does not need to be a full subsection; it could be a short remark
  at the end of §6.3.

  Contributions list vs. paper structure. Contribution 2 says the DP "improves on exhaustive search over $\binom{n-1}{K-1}$ boundary placements." The natural baseline for
   practitioners is greedy binary segmentation (O(n log n)), not exhaustive search. The claim is accurate but framed to look better than it is. Greedy search is mentioned
   in §4 and §2 but not included in the empirical comparisons.

  ---
  Minor Issues

  - The TikZ figure (Fig. 1, CRPS definition) has the Heaviside defined as 1[t ≥ y] (note: ≥). Check whether the integral formula in eq.(1) uses the same convention —
  some references define CRPS with 1[t < y], which changes the sign of the integrand but not the final value.
  - The bib entry for Devroye (1988) bundles a book reference (Devroye & Györfi 1985) inside the same entry. These should be separate items.
  - The paper does not have an author list or affiliations, which means the reproducibility section (datasets from a URL, code in save_figures.py) is incomplete. For
  submission, the code should be anonymised or properly referenced.

  ---
  Summary Verdict

  The core idea is sound and the technical derivation of Proposition 1 is the main contribution. The paper needs: (1) a written-out proof of Proposition 1; (2) resolution
   of the m=1 degeneracy in the DP; (3) clarification of the always-interval claim in §7.4; (4) acknowledgment that sub-nominal coverage on motorcycle is a
  theory-practice gap; (5) a fairer competitor comparison with matched information; and (6) structural changes to move §10.2 earlier and reduce the k-NN tangent. In its
  current form this would likely receive "major revision" from a top ML/statistics venue.

