RAIL Evaluation - Checking results against DC1 paper

The purpose of this notebook is to validate the reimplementation of the DC1 metrics, previously available on Github repository PZDC1paper, now refactored to be part of RAIL Evaluation module. The metrics here were implemented in object-oriented Python 3, following a superclass/subclass structure, and inheriting features from qp.

DC1 results

The DC1 results are stored in the class DC1 (defined in utils.py ancillary file), which exists only to provide the reference values.

To access individual metrics, one can call the metrics dictionary dc1.results using the codes and metrics names as keys.

The list of codes and metrics available can be accessed by the properties dc1.codes and dc1.metrics.


The data

In this notebook we use the same input dataset used in DC1 PZ paper (Schmidt et al. 2020), copied from cori (/global/cfs/cdirs/lsst/groups/PZ/PhotoZDC1/photoz_results/TESTDC1FLEXZ).

Metrics

Metrics calculated based on the PITs computed via qp.Ensemble CDF method. The PIT values can be passed as optional input to speed up the metrics calculation. If no PIT array is provided, it is calculated on the fly.

PIT-QQ plot


"Debugging"

Following Sam's suggestion, I also computed the metrics reading the PIT values from the partial results of DC1 paper, instead of calculating them from scratch.

Reading DC1 PIT values (PITs computed in the past for the paper):

The values are slightly different.

Recalculating the metrics:

Using the original PIT values from the paper, all metrics match reasonably, except for the Anderson-Darling statistic.

Anderson-Darling

$$ \mathrm{AD}^2 \equiv N_{tot} \int_{-\infty}^{\infty} \frac{\big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)^{2}}{\mathrm{CDF} \small[ \hat{f}, z \small] \big( 1 \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)}\mathrm{dCDF}(\tilde{f}, z) $$

The class AD uses scipy.stats.anderson_ksamp method to compute the Anderson-Darling statistic for the PIT values by comparing with a uniform distribution between 0 and 1. Up to the current version (1.6.2), scipy.stats.anderson (the 1-sample test) does not support uniform distributions as reference sample.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson_ksamp.html

By default, the AD is computed within the interval $0.0 \leq PIT \leq 1.0$.

5 objects have PIT values out of this interval (this is unexpected).

It is possible to remove extreme values of PIT, as done in the paper.


Point estimates metrics

These metrics are deprecated and might not be used in future analyses. They are included in this notebook for the sake of reproducing the results from the paper in its totality.


Conclusion

The metrics calculated using the new implementation are reasonably close to the expected values. Minor differences were observed due to the difference in the calculation of PIT values. In both cases, here and in the paper, the PITs were calculated using qp functions. The small diferences are attributed to minor changes in qp versions since when the paper was produced.

When using the original values of PIT, i.e., those calculated for the paper using the qp version availabe at the time, all metrics were reproduced, except for the AD test. This particular metric is quite sensitive to the range of PITs considered in the calculation. Using the same PIT interval as used in the paper (0.01,0.99), the result obtained using the new implementation diverges from the paper result by 19.3%.