The purpose of this notebook is to demonstrate the application of the metrics scripts to be used on the photo-z PDF catalogs produced by the PZ working group. The first implementation of the evaluation module is based on the refactoring of the code used in Schmidt et al. 2020, available on Github repository PZDC1paper.
To run this notebook, you must install qp and have the notebook in the same directory as utils.py
(available in RAIL's examples directrory). You must also install some run-of-the-mill Python packages: numpy, scipy, matplotlib, and seaborn.
from rail.evaluation.metrics.pit import *
from rail.evaluation.metrics.cdeloss import *
from utils import read_pz_output, plot_pit_qq, ks_plot
from main import Summary
import qp
import os
%matplotlib inline
%reload_ext autoreload
%autoreload 2
Found classifier FZBoost Found classifier randomPZ Found classifier simpleNN Found classifier trainZ
To compute the photo-z metrics of a given test sample, it is necessary to read the output of a photo-z code containing galaxies' photo-z PDFs. Let's use the toy data available in tests/data/
(test_dc2_training_9816.hdf5 and test_dc2_validation_9816.hdf5) and the configuration file available in examples/configs/FZBoost.yaml
to generate a small sample of photo-z PDFs using the FZBoost algorithm available on RAIL's estimation module.
Go to dir <your_path>/RAIL/examples/estimation/
and run the command:
python main.py configs/FZBoost.yaml
The photo-z output files (inputs for this notebook) will be writen at:
<your_path>/RAIL/examples/estimation/results/FZBoost/test_FZBoost.hdf5
.
Let's use the ancillary function read_pz_output to facilitate the reading of all necessary data.
my_path = '/Users/julia/github/RAIL' # replace it by your path to RAIL's parent dir
pdfs_file = os.path.join(my_path, "examples/estimation/results/FZBoost/test_FZBoost.hdf5")
ztrue_file = os.path.join(my_path, "tests/data/test_dc2_validation_9816.hdf5")
pdfs, zgrid, ztrue, photoz_mode = read_pz_output(pdfs_file, ztrue_file) # all numpy arrays
The inputs for the metrics shown above are the array of true (or spectroscopic) redshifts, and an ensemble of photo-z PDFs (a qp.Ensemble
object).
fzdata = qp.Ensemble(qp.interp, data=dict(xvals=zgrid, yvals=pdfs))
The Probability Integral Transform (PIT), is the Cumulative Distribution Function (CDF) of the photo-z PDF
$$ \mathrm{CDF}(f, q)\ =\ \int_{-\infty}^{q}\ f(z)\ dz $$evaluated at the galaxy's true redshift for every galaxy $i$ in the catalog.
$$ \mathrm{PIT}(p_{i}(z);\ z_{i})\ =\ \int_{-\infty}^{z^{true}_{i}}\ p_{i}(z)\ dz $$pitobj = PIT(fzdata, ztrue)
spl_ens, metamets = pitobj.evaluate()
/Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:172: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals]) /Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:189: UserWarning: This KLD implementation is based on scipy.stats.entropy, therefore it uses a discrete distribution of PITs (internally obtained from PIT object). "therefore it uses a discrete distribution of PITs " +
The evaluate method od PIT class returns two objects, a spline for samples (a frozen distribution object), and a dictionary of meta metrics associated to PIT (to be detailed below).
spl_ens
<qp.pdf_gen.rv_frozen_rows at 0x7fd9ab557cd0>
spl_ens.samples
array([0.1140265 , 0.35794342, 0.42871298, ..., 1. , 0.85702247, 0.49560049])
metamets
{(rail.evaluation.metrics.pit.PITOutRate, 'default'): 0.00016060480817661022, (rail.evaluation.metrics.pit.PITAD, 'default'): stat_crit_sig(statistic=88.00142575366617, critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), significance_level=0.001), (rail.evaluation.metrics.pit.PITKLD, 'default'): stat_and_pval(statistic=0.09179415534209365, p_value=None)}
PIT values
pit_vals = np.array(pitobj._pit_samps)
pit_vals
array([0.1140265 , 0.35794342, 0.42871298, ..., 1. , 0.85702247, 0.49560049])
The PIT outlier rate is a global metric defined as the fraction of galaxies in the sample with extreme PIT values. The lower and upper limits for considering a PIT as outlier are optional parameters set at the Metrics instantiation (default values are: PIT $<10^{-4}$ or PIT $>0.9999$).
pit_out_rate = PITOutRate(spl_ens).evaluate()
print(f"PIT outlier rate of this sample: {pit_out_rate:.6f}")
PIT outlier rate of this sample: 0.000161
The histogram of PIT values is a useful tool for a qualitative assessment of PDFs quality. It shows whether the PDFs are:
Following the standards in DC1 paper, the PIT histogram is accompanied by the quantile-quantile (QQ), which can be used to compare qualitatively the PIT distribution obtained with the PDFs agaist the ideal case (uniform distribution). The closer the QQ plot is to the diagonal, the better is the PDFs calibration.
plot_pit_qq(pdfs, zgrid, ztrue, title="PIT-QQ - toy data", code="FZBoost",
pit_out_rate=pit_out_rate, savefig=False)
/Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:172: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals]) /Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:189: UserWarning: This KLD implementation is based on scipy.stats.entropy, therefore it uses a discrete distribution of PITs (internally obtained from PIT object). "therefore it uses a discrete distribution of PITs " +
The black horizontal line represents the ideal case where the PIT histogram would behave as a uniform distribution U(0,1).
To evaluate globally the quality of PDFs estimates, rail.evaluation
provides a set of metrics to compare the empirical distributions of PIT values with the reference uniform distribution, U(0,1).
Let's start with the traditional Kolmogorov-Smirnov (KS) statistic test, which is the maximum difference between the empirical and the expected cumulative distributions of PIT values:
$$ \mathrm{KS} \equiv \max_{PIT} \Big( \left| \ \mathrm{CDF} \small[ \hat{f}, z \small] - \mathrm{CDF} \small[ \tilde{f}, z \small] \ \right| \Big) $$Where $\hat{f}$ is the PIT distribution and $\tilde{f}$ is U(0,1). Therefore, the smaller value of KS the closer the PIT distribution is to be uniform. The evaluate
method of the PITKS class returns a named tuple with the statistic and p-value.
ksobj = PITKS(spl_ens)
ks_stat_and_pval = ksobj.evaluate()
ks_stat_and_pval
stat_and_pval(statistic=0.03696702749284975, p_value=1.0248437261318777e-24)
Visual interpretation of the KS statistic:
ks_plot(pitobj)
/Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:172: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals]) /Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:189: UserWarning: This KLD implementation is based on scipy.stats.entropy, therefore it uses a discrete distribution of PITs (internally obtained from PIT object). "therefore it uses a discrete distribution of PITs " +
print(f"KS metric of this sample: {ks_stat_and_pval.statistic:.4f}")
KS metric of this sample: 0.0370
Similarly, let's calculate the Cramer-von Mises (CvM) test, a variant of the KS statistic defined as the mean-square difference between the CDFs of an empirical PDF and the true PDFs:
$$ \mathrm{CvM}^2 \equiv \int_{-\infty}^{\infty} \Big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \Big)^{2} \mathrm{dCDF}(\tilde{f}, z) $$on the distribution of PIT values, which should be uniform if the PDFs are perfect.
cvmobj = PITCvM(spl_ens)
cvm_stat_and_pval = cvmobj.evaluate()
print(f"CvM metric of this sample: {cvm_stat_and_pval.statistic:.4f}")
CvM metric of this sample: 7.8558
Another variation of the KS statistic is the Anderson-Darling (AD) test, a weighted mean-squared difference featuring enhanced sensitivity to discrepancies in the tails of the distribution.
$$ \mathrm{AD}^2 \equiv N_{tot} \int_{-\infty}^{\infty} \frac{\big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)^{2}}{\mathrm{CDF} \small[ \tilde{f}, z \small] \big( 1 \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)}\mathrm{dCDF}(\tilde{f}, z) $$adobj = PITAD(spl_ens)
ad_stat_crit_sig = adobj.evaluate()
ad_stat_crit_sig
/Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:172: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals])
stat_crit_sig(statistic=88.00142575366617, critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), significance_level=0.001)
ad_stat_crit_sig
stat_crit_sig(statistic=88.00142575366617, critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), significance_level=0.001)
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}")
AD metric of this sample: 88.0014
It is possible to remove catastrophic outliers before calculating the integral for the sake of preserving numerical instability. For instance, Schmidt et al. computed the Anderson-Darling statistic within the interval (0.01, 0.99).
ad_stat_crit_sig_cut = adobj.evaluate(pit_min=0.01, pit_max=0.99)
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}")
print(f"AD metric for 0.01 < PIT < 0.99: {ad_stat_crit_sig_cut.statistic:.4f}")
1883 PITs removed from the sample. AD metric of this sample: 88.0014 AD metric for 0.01 < PIT < 0.99: 76.5858
Another way to quantify the difference between two distributions is the Kullback–Leibler divergence, also known as relative entropy), defined as:
$$ \mathrm{D}_{KL}(P||Q) = \int_{-\infty}^{\infty} P(x) \log{ \left( \frac{P(x)}{Q(x)} \right) dx }$$in this case,
$$ \mathrm{D}_{KL} = \int_{-\infty}^{\infty} \mathrm{CDF} \small[ \hat{f}, z \small] \ \log{ \left( \frac{\mathrm{CDF} \small[ \hat{f}, z \small]}{\mathrm{CDF}\small[\tilde{f}, z\small]}\right) dx }$$kldobj = PITKLD(spl_ens)
kld_stat_and_pval = kldobj.evaluate()
kld_stat_and_pval
/Users/julia/github/RAIL/rail/evaluation/metrics/pit.py:189: UserWarning: This KLD implementation is based on scipy.stats.entropy, therefore it uses a discrete distribution of PITs (internally obtained from PIT object). "therefore it uses a discrete distribution of PITs " +
stat_and_pval(statistic=0.09179415534209365, p_value=None)
print(f"CvM metric of this sample: {kld_stat_and_pval.statistic:.4f}")
CvM metric of this sample: 0.0918
In the absence of true photo-z posteriors, the metric used to evaluate individual PDFs is the Conditional Density Estimate (CDE) Loss, a metric analogue to the root-mean-squared-error:
$$ L(f, \hat{f}) \equiv \int \int {\big(f(z | x) - \hat{f}(z | x) \big)}^{2} dzdP(x), $$where $f(z | x)$ is the true photo-z PDF and $\hat{f}(z | x)$ is the estimated PDF in terms of the photometry $x$. Since $f(z | x)$ is unknown, we estimate the CDE Loss as described in Izbicki & Lee, 2017 (arXiv:1704.08095). :
$$ \mathrm{CDE} = \mathbb{E}\big( \int{{\hat{f}(z | X)}^2 dz} \big) - 2{\mathbb{E}}_{X, Z}\big(\hat{f}(Z, X) \big) + K_{f}, $$where the first term is the expectation value of photo-z posterior with respect to the marginal distribution of the covariates X, and the second term is the expectation value with respect to the joint distribution of observables X and the space Z of all possible redshifts (in practice, the centroids of the PDF bins), and the third term is a constant depending on the true conditional densities $f(z | x)$.
cdelossobj = CDELoss(fzdata, zgrid, ztrue)
cde_stat_and_pval = cdelossobj.evaluate()
cde_stat_and_pval
stat_and_pval(statistic=6.022594953531476, p_value=None)
print(f"CDE loss of this sample: {cde_stat_and_pval.statistic:.2f}")
CDE loss of this sample: 6.02
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pitobj=pitobj) # pitobj as optional input to speed-up metrics evaluation
Metric | Value |
---|---|
PIT out rate | 0.0002 |
KS | 0.0370 |
CvM | 7.8558 |
AD | 88.0014 |
CDE loss | 6.02 |
KLD | 0.0918 |
summary.markdown_metrics_table(pitobj=pitobj, show_dc1="FlexZBoost")
Metric | Value | DC1 reference value |
---|---|---|
PIT out rate | 0.0002 | 0.0202 |
KS | 0.0370 | 0.0240 |
CvM | 7.8558 | 68.8300 |
AD | 88.0014 | 478.8000 |
KLD | 0.0918 | N/A |
CDE loss | 6.02 | -10.60 |