June 28 update: I modified the summarizers to output not just N sample N(z) distributions (saved to the file specified via the output keyword), but also the single fiducial N(z) estimate (saved to the file specified via the single_NZ keyword). I also updated NZDir and included it in this example notebook

In [1]:
import os
import rail
import numpy as np
import pandas as pd
import tables_io
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
from rail.estimation.algos.knnpz import Inform_KNearNeighPDF, KNearNeighPDF
In [3]:
from rail.estimation.algos.varInference import VarInferenceStack
from rail.estimation.algos.naiveStack import NaiveStack
from rail.estimation.algos.pointEstimateHist import PointEstimateHist
from rail.estimation.algos.NZDir import Inform_NZDir, NZDir
from rail.core.data import TableHandle, QPHandle
from rail.core.stage import RailStage
In [4]:
import qp
In [5]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

To create some N(z) distributions, we'll want some PDFs to work with first, for a quick demo let's just run some photo-z's using the KNearNeighPDF estimator using the 10,000 training galaxies to generate ~20,000 PDFs using data from healpix 9816 of cosmoDC2_v1.1.4 that are included in the RAIL repo:

In [6]:
knn_dict = dict(zmin=0.0, zmax=3.0, nzbins=301, trainfrac=0.75,
                sigma_grid_min=0.01, sigma_grid_max=0.07, ngrid_sigma=10,
                nneigh_min=3, nneigh_max=7, hdf5_groupname='photometry')
In [7]:
pz_train = Inform_KNearNeighPDF.make_stage(name='inform_KNN', model='demo_knn.pkl', **knn_dict)
In [8]:
# Load up the example healpix 9816 data and stick in the DataStore
from rail.core.utils import RAILDIR
trainFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_validation_9816.hdf5')
training_data = DS.read_file("training_data", TableHandle, trainFile)
test_data = DS.read_file("test_data", TableHandle, testFile)
In [9]:
# train knnpz
pz_train.inform(training_data)
split into 7669 training and 2556 validation samples
finding best fit sigma and NNeigh...



best fit values are sigma=0.03 and numneigh=7



Inserting handle into data store.  model_inform_KNN: inprogress_demo_knn.pkl, inform_KNN
Out[9]:
<rail.core.data.ModelHandle at 0x7fc618ad1fa0>
In [10]:
pz = KNearNeighPDF.make_stage(name='KNN', hdf5_groupname='photometry',
                              model=pz_train.get_handle('model'))
qp_data = pz.estimate(test_data)
Process 0 running estimator on chunk 0 - 10000
Process 0 estimating PZ PDF for rows 0 - 10,000
Inserting handle into data store.  output_KNN: inprogress_output_KNN.hdf5, KNN
Process 0 running estimator on chunk 10000 - 20000
Process 0 estimating PZ PDF for rows 10,000 - 20,000
Process 0 running estimator on chunk 20000 - 20449
Process 0 estimating PZ PDF for rows 20,000 - 20,449

So, qp_data now contains the 20,000 PDFs from KNearNeighPDF, we can feed this in to three summarizers to generate an overall N(z) distribution. We won't bother with any tomographic selections for this demo, just the overall N(z). It is stored as qp_data, but has also been saved to disk as output_KNN.fits as an astropy table. If you want to read in this data to grab the qp Ensemble at a later stage, you can do this via qp with a ens = qp.read("output_KNN.fits")

I coded up quick and dirty bootstrap versions of the NaiveStack, PointEstimateHist, and VarInference sumarizers. These are not optimized, not parallel (issue created for future update), but they do produce N different bootstrap realizations of the overall N(z) which are returned as a qp Ensemble (Note: the previous versions of these degraders returned only the single overall N(z) rather than samples).

Naive Stack¶

Naive stack just "stacks" i.e. sums up, the PDFs and returns a qp.interp distribution with bins defined by np.linspace(zmin, zmax, nzbins), we will create a stack with 41 bins and generate 20 bootstrap realizations

In [11]:
stacker = NaiveStack.make_stage(zmin=0.0, zmax=3.0, nzbins=41, nsamples=20, output="Naive_samples.hdf5", single_NZ="NaiveStack_NZ.hdf5")
In [12]:
naive_results = stacker.summarize(qp_data)
Inserting handle into data store.  output: inprogress_Naive_samples.hdf5, NaiveStack
Inserting handle into data store.  single_NZ: inprogress_NaiveStack_NZ.hdf5, NaiveStack

The results are now in naive_results, but because of the DataStore, the actual ensemble is stored in .data, let's grab the ensemble and plot a few of the bootstrap sample N(z) estimates:

In [13]:
newens = naive_results.data
In [14]:
fig, axs = plt.subplots(figsize=(8,6))
for i in range(0, 20, 2):
    newens[i].plot_native(axes=axs, label=f"sample {i}")
axs.plot([0,3],[0,0],'k--')
axs.set_xlim(0,3)
axs.legend(loc='upper right')
Out[14]:
<matplotlib.legend.Legend at 0x7fc618a6e4c0>

The summarizer also outputs a second file containing the fiducial N(z). We saved the fiducial N(z) in the file "NaiveStack_NZ.hdf5", let's grab the N(z) estimate with qp and plot it with the native plotter:

In [15]:
naive_nz = qp.read("NaiveStack_NZ.hdf5")
naive_nz.plot_native(xlim=(0,3))
Out[15]:
<Axes: xlabel='redshift', ylabel='p(z)'>

Point Estimate Hist¶

PointEstimateHist takes the point estimate mode of each PDF and then histograms these, we'll again generate 41 bootstrap samples of this and plot a few of the resultant histograms. Note: For some reason the plotting on the histogram distribution in qp is a little wonky, it appears alpha is broken, so this plot is not the best:

In [16]:
pointy = PointEstimateHist.make_stage(zmin=0.0, zmax=3.0, nzbins=41, nsamples=20, single_NZ="point_NZ.hdf5", output="point_samples.hdf5")
In [17]:
%%time
pointy_results = pointy.summarize(qp_data)
Inserting handle into data store.  output: inprogress_point_samples.hdf5, PointEstimateHist
Inserting handle into data store.  single_NZ: inprogress_point_NZ.hdf5, PointEstimateHist
CPU times: user 35.8 ms, sys: 2.74 ms, total: 38.5 ms
Wall time: 38.2 ms
In [18]:
pens = pointy_results.data
In [19]:
fig, axs = plt.subplots(figsize=(8,6))
pens[0].plot_native(axes=axs, fc = [0, 0, 1, 0.01])
pens[1].plot_native(axes=axs, fc = [0, 1, 0, 0.01])
pens[4].plot_native(axes=axs, fc = [1, 0, 0, 0.01])
axs.set_xlim(0,3)
axs.legend()
Out[19]:
<matplotlib.legend.Legend at 0x7fc5dc871f10>

Again, we have saved the fiducial N(z) in a separate file, "point_NZ.hdf5", we could read that data in if we desired.

varInference¶

VarInference implements Markus' variational inference scheme and returns qp.interp gridded distribution. varInference tends to get a little wonky if you use too many bins, so we'll only use 25 bins. Again let's generate 20 samples and plot a few:

In [20]:
runner=VarInferenceStack.make_stage(name='test_varinf', zmin=0.0,zmax=3.0,nzbins=25, niter=10, nsamples=20,
                                    output="sampletest.hdf5", single_NZ="varinf_NZ.hdf5")
In [21]:
%%time
varinf_results = runner.summarize(qp_data)
Inserting handle into data store.  output_test_varinf: inprogress_sampletest.hdf5, test_varinf
Inserting handle into data store.  single_NZ_test_varinf: inprogress_varinf_NZ.hdf5, test_varinf
CPU times: user 1.94 s, sys: 127 ms, total: 2.07 s
Wall time: 2.07 s
In [22]:
vens = varinf_results.data
vens
Out[22]:
<qp.ensemble.Ensemble at 0x7fc5dc655190>

Let's plot the fiducial N(z) for this distribution:

In [23]:
varinf_nz = qp.read("varinf_NZ.hdf5")
varinf_nz.plot_native(xlim=(0,3))
Out[23]:
<Axes: xlabel='redshift', ylabel='p(z)'>

NZDir¶

NZDir is a different type of summarizer, taking a weighted set of neighbors to a set of training spectroscopic objects to reconstruct the redshift distribution of the photometric sample. I implemented a bootstrap of the spectroscopic data rather than the photometric data, both because it was much easier computationally, and I think that the spectroscopic variance is more important to take account of than simple bootstrap of the large photometric sample. We must first run the inform_NZDir stage to train up the K nearest neigh tree used by NZDir, then we will run NZDir to actually construct the N(z) estimate.

Like PointEstimateHist NZDir returns a qp.hist ensemble of samples

In [24]:
inf_nz = Inform_NZDir.make_stage(n_neigh=8, hdf5_groupname="photometry", model="nzdir_model.pkl")
In [25]:
inf_nz.inform(training_data)
Inserting handle into data store.  model: inprogress_nzdir_model.pkl, Inform_NZDir
Out[25]:
<rail.core.data.ModelHandle at 0x7fc618af8d60>
In [26]:
nzd = NZDir.make_stage(leafsize=20, zmin=0.0, zmax=3.0, nzbins=31, model="NZDir_model.pkl", hdf5_groupname='photometry',
                       output='NZDir_samples.hdf5', single_NZ='NZDir_NZ.hdf5')
In [27]:
nzd_res = nzd.estimate(test_data)
Inserting handle into data store.  output: inprogress_NZDir_samples.hdf5, NZDir
Inserting handle into data store.  single_NZ: inprogress_NZDir_NZ.hdf5, NZDir
In [28]:
nzd_ens = nzd_res.data
In [29]:
nzdir_nz = qp.read("NZDir_NZ.hdf5")
In [30]:
fig, axs = plt.subplots(figsize=(10,8))
nzd_ens[0].plot_native(axes=axs, fc = [0, 0, 1, 0.01])
nzd_ens[1].plot_native(axes=axs, fc = [0, 1, 0, 0.01])
nzd_ens[4].plot_native(axes=axs, fc = [1, 0, 0, 0.01])
axs.set_xlim(0,3)
axs.legend()
Out[30]:
<matplotlib.legend.Legend at 0x7fc5dc615820>

As we also wrote out the single estimate of N(z) we can read that data from the second file written (specified by the single_NZ argument given in NZDir.make_stage above, in this case "NZDir_NZ.hdf5")

In [31]:
nzdir_nz = qp.read("NZDir_NZ.hdf5")
In [32]:
nzdir_nz.plot_native(xlim=(0,3))
Out[32]:
<Axes: xlabel='redshift', ylabel='p(z)'>

Results¶

All three results files are qp distributions, NaiveStack and varInference return qp.interp distributions while pointEstimateHist returns a qp.histogram distribution. Even with the different distributions you can use qp functionality to do things like determine the means, modes, etc... of the distributions. You could then use the std dev of any of these to estimate a 1 sigma "shift", etc...

In [33]:
zgrid = np.linspace(0,3,41)
names = ['naive', 'point', 'varinf', 'nzdir']
enslist = [newens, pens, vens, nzd_ens]
results_dict = {}
for nm, en in zip(names, enslist):
    results_dict[f'{nm}_modes'] = en.mode(grid=zgrid).flatten()
    results_dict[f'{nm}_means'] = en.mean().flatten()
    results_dict[f'{nm}_std'] = en.std().flatten()
In [34]:
results_dict
Out[34]:
{'naive_modes': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,
        0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
 'naive_means': array([0.91149502, 0.90689545, 0.91136832, 0.90765168, 0.90561592,
        0.91330512, 0.91197592, 0.90824209, 0.91451169, 0.90922789,
        0.9087516 , 0.91218998, 0.91076003, 0.91391544, 0.90809329,
        0.90649634, 0.90949125, 0.91140398, 0.90799529, 0.91011015]),
 'naive_std': array([0.45672834, 0.45779718, 0.45799255, 0.45608257, 0.45772315,
        0.45806096, 0.45664706, 0.45644607, 0.45852141, 0.45562806,
        0.46094525, 0.45920772, 0.4539561 , 0.45695217, 0.45396763,
        0.45660786, 0.45550625, 0.45593187, 0.45400031, 0.456491  ]),
 'point_modes': array([0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  ,
        0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.9  , 0.975,
        0.9  , 0.9  ]),
 'point_means': array([0.90757435, 0.90355244, 0.90913087, 0.90599994, 0.90128744,
        0.91024727, 0.90898774, 0.90411422, 0.91178947, 0.90571726,
        0.90523062, 0.90943859, 0.90735965, 0.91075179, 0.90469031,
        0.90337353, 0.90669411, 0.90870864, 0.90467242, 0.90689091]),
 'point_std': array([0.44423985, 0.44567538, 0.4462743 , 0.44507997, 0.44419829,
        0.44549633, 0.4453005 , 0.44285937, 0.44654285, 0.44260597,
        0.44841522, 0.44692404, 0.44150332, 0.44463125, 0.44197287,
        0.44544701, 0.44275794, 0.44302682, 0.44123106, 0.44419575]),
 'varinf_modes': array([0.975, 0.975, 0.9  , 0.9  , 0.975, 0.975, 0.975, 0.9  , 0.9  ,
        0.9  , 0.9  , 0.975, 0.975, 0.975, 0.975, 0.9  , 0.975, 0.975,
        0.975, 0.975]),
 'varinf_means': array([0.88854022, 0.89158319, 0.89110634, 0.89315605, 0.89387354,
        0.89097531, 0.89294196, 0.89228821, 0.88712166, 0.88899743,
        0.88741712, 0.88722119, 0.89265368, 0.89404193, 0.89334961,
        0.89796394, 0.89413918, 0.88890824, 0.89510198, 0.88763958]),
 'varinf_std': array([0.41459107, 0.41830515, 0.41732654, 0.41270001, 0.41330109,
        0.41402322, 0.41774209, 0.41649308, 0.41708491, 0.41661354,
        0.41329181, 0.417675  , 0.41721839, 0.41584016, 0.41277957,
        0.41845365, 0.41651765, 0.41395451, 0.41459264, 0.4135    ]),
 'nzdir_modes': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,
        0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
 'nzdir_means': array([0.91798956, 0.92176052, 0.91434658, 0.92033863, 0.91362204,
        0.92948617, 0.92304575, 0.91872134, 0.91385976, 0.92238134,
        0.9252893 , 0.92597024, 0.92721059, 0.92450514, 0.92192798,
        0.91739566, 0.92603753, 0.92105897, 0.91872165, 0.9210996 ]),
 'nzdir_std': array([0.46447022, 0.46621283, 0.46657924, 0.4652038 , 0.46269032,
        0.46635063, 0.46901116, 0.46889646, 0.46817098, 0.46865234,
        0.46976555, 0.46632455, 0.46713524, 0.46255762, 0.4702946 ,
        0.46672201, 0.46883034, 0.46173737, 0.46291822, 0.46649785])}

You can also use qp to compute quantities the pdf, cdf, ppf, etc... on any grid that you want, much of the functionality of scipy.stats distributions have been inherited by qp ensembles

In [35]:
newgrid = np.linspace(0.005,2.995, 35)
naive_pdf = newens.pdf(newgrid)
point_cdf = pens.cdf(newgrid)
var_ppf = vens.ppf(newgrid)
In [36]:
plt.plot(newgrid, naive_pdf[0])
Out[36]:
[<matplotlib.lines.Line2D at 0x7fc5dc524490>]
In [37]:
plt.plot(newgrid, point_cdf[0])
Out[37]:
[<matplotlib.lines.Line2D at 0x7fc5dc45b6d0>]
In [38]:
plt.plot(newgrid, var_ppf[0])
Out[38]:
[<matplotlib.lines.Line2D at 0x7fc5dc391dc0>]

Shifts¶

If you want to "shift" a PDF, you can just evaluate the PDF on a shifted grid, for example to shift the PDF by +0.0375 in redshift you could evaluate on a shifted grid. For now we can just do this "by hand", we could easily implement shift functionality in qp, I think.

In [39]:
def_grid = np.linspace(0., 3., 41)
shift_grid = def_grid - 0.0675
native_nz = newens.pdf(def_grid)
shift_nz = newens.pdf(shift_grid)
In [40]:
fig=plt.figure(figsize=(12,10))
plt.plot(def_grid, native_nz[0], label="original")
plt.plot(def_grid, shift_nz[0], label="shifted +0.0675")
plt.legend(loc='upper right')
Out[40]:
<matplotlib.legend.Legend at 0x7fc5dc371f70>

You can estimate how much shift you might expect based on the statistics of our bootstrap samples, say the std dev of the means for the NZDir-derived distribution:

In [41]:
results_dict['nzdir_means']
Out[41]:
array([0.91798956, 0.92176052, 0.91434658, 0.92033863, 0.91362204,
       0.92948617, 0.92304575, 0.91872134, 0.91385976, 0.92238134,
       0.9252893 , 0.92597024, 0.92721059, 0.92450514, 0.92192798,
       0.91739566, 0.92603753, 0.92105897, 0.91872165, 0.9210996 ])
In [42]:
spread = np.std(results_dict['nzdir_means'])
In [43]:
spread
Out[43]:
0.0043607326241909454

Again, not a huge spread in predicted mean redshifts based solely on bootstraps, even with only ~20,000 galaxies.

In [ ]: