First of all, the top tier venue was a stretch goal. Targeting this for COPA is perfectly acceptable.

One concern is that maybe I jumped the gun and I have already put the manuscript on arXiv. I wonder if it's bad if there are multiple versions of the manuscript in a relatively short interval time. I also hear that it's good to put the document out there and solicit some kind of feedback.

Let's now talk about the issues one by one.

## Tier 1

M1: I am OK with your take on it. I cannot see at present how we could extend it to multi-d, but I cannot come up with an explicit obstacle.

M3: This cross validation misleading statement is what actually hit me most. This should really be fixed by running a proper K-fold cross validation. If there's a reason not to implement K-fold cross validation now, then we can do what you suggest and just fix the terminology.

M6: OK

m1 (Proposition 3): OK

m6: OK

## Tier 2

M2: Yes, I think we overstated the merits. I think this affects not only the main text but also the abstract and painfully reduces the value of the whole approach. Personally, I never quite liked evaluating methods on small datasets, but if this is what the community is used to, then we have to stick with it.

M3: perhaps we should bite the bullet and do a K-fold CV with 50/50 random splits (and update the demo app accordingly)

M5: I still have to learn more about IDR, but yes we cannot ignore it. As to conditional coverage, it's a valid point. My take, though, is that it makes sense to evaluate it only on large enough data sets. In bins with few data points we will see large variations.

## Tier 3

M4 (Exchangeabilitgy gap: A theoretical issue that requires deeper thought on my side. I would postpone for this first round of fixes.

M5: I would love to use larger datasets. I cannot convince myself that taking a multi-dimensional dataset and doing it on the significant covariate makes sense. I have struggled to find large datasets that are meaningful for this. And I also wonder whether other papers always use large datasets.

m5 (quadrangle inequality): The quickest outcome would be to find a counter-example. I would have to spend some time reasoning if you can come up with an example where it doesn't apply. But I may enlist the help of a colleague with a mathematical bent.

## Tier 4

OK for all items

## Proposed action plan

Let's work on item 1 to 6 for now. I am leaning towards implementing k-fold CV as it is the established standard. It is the most glaring weakness for me. Was there a reason (that I cannot see now) that we did not go for it from the start?
Let's not forget that we will have to update the Python package and demo code accordingly.
Should we create a fork/branch at this point?

