author: Sam Schmidt
Last successfully run: Apr 18, 2022
The switchover to a ceci
-based backend has increased the complexity of methods of data access and IO, this notebook will demonstrate a variety of ways that users may interact with data in RAIL
In addition to the main RAIL code, we have developed another companion package, tables_io
available here on Github.
tables_io
aims to simplify IO for reading/writing to some of the most common file formats used within DESC: HDF5 files, parquet files, Astropy tables, and qp
ensembles. There are several examples of tables_io usage in the nb directory of the tables_io
repository, but we will demonstrate usage in several places in this notebook as well. For furthe examples consult the tables_io nb examples.
In short, tables_io
aims to simplify fileIO, and much of the io is automatically sorted out for you if your files have the appriorate extensions: that is, you can simply do a tables_io.read("file.fits") to read in a fits file or tables_io.read("newfile.pq") to read in a dataframe in parquet format. Similarly, you can specify the output format via the extension as well. This functionality is extended to qp
and RAIL
through their use of tables_io
, and file extensions will control how files are read and written unless explicitly overridden.
Another concept used in the ceci
-based RAIL when used in a Jupyter Notebook is the DataStore and DataHandle file specifications (see RAIL/rail/core/data.py for the actual code implementing these). ceci
requires that each pipeline stage have defined input
and output
files, and is primarily geared toward pipelines rather than interactive runs with a jupyter notebook. The DataStore enables interactive use of files in Jupyter. We will demonstrate some useful features of the DataStore below.
Let's start out with some imports:
import os
import tables_io
import rail
import qp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
First, let's use tables_io to read in some example data. There are two example files that ship with RAIL containing a small amount of cosmoDC2 data from healpix pixel 9816
, it is located in the RAIL/tests/data/
directory in the RAIL repository, one for "training" and one for "validation". Let's read in one of those data files with tables_io:
(NOTE: for historical reasons, our examples files have data that is in hdf5 format where all of the data arrays are actually in a single hdf5 group named "photometry". We will grab the data specifically from that hdf5 group by reading in the file and specifying ["photometry"] as the group in the cell below. We'll call our dataset "traindata_io" to indicate that we've read it in via tables_io, and distinguish it from the data that we'll place in the DataStore in later steps:
from rail.core.utils import RAILDIR
trainFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_validation_9816.hdf5')
traindata_io = tables_io.read(trainFile)["photometry"]
tables_io reads this data in as an ordered dictionary of numpy arrays by default, though you can be converted to other data formats, such as a pandas dataframe as well. Let's print out the keys in the ordered dict showing the available columns, then convert the data to a pandas dataframe and look at a few of the columns as a demo:
traindata_io.keys()
odict_keys(['id', 'mag_err_g_lsst', 'mag_err_i_lsst', 'mag_err_r_lsst', 'mag_err_u_lsst', 'mag_err_y_lsst', 'mag_err_z_lsst', 'mag_g_lsst', 'mag_i_lsst', 'mag_r_lsst', 'mag_u_lsst', 'mag_y_lsst', 'mag_z_lsst', 'redshift'])
traindata_pq = tables_io.convert(traindata_io, tables_io.types.PD_DATAFRAME)
traindata_pq.head()
id | mag_err_g_lsst | mag_err_i_lsst | mag_err_r_lsst | mag_err_u_lsst | mag_err_y_lsst | mag_err_z_lsst | mag_g_lsst | mag_i_lsst | mag_r_lsst | mag_u_lsst | mag_y_lsst | mag_z_lsst | redshift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8062500000 | 0.005001 | 0.005001 | 0.005001 | 0.005046 | 0.005003 | 0.005001 | 16.960892 | 16.506310 | 16.653412 | 18.040369 | 16.423904 | 16.466377 | 0.020435 |
1 | 8062500062 | 0.005084 | 0.005075 | 0.005048 | 0.009552 | 0.005804 | 0.005193 | 20.709402 | 20.437565 | 20.533852 | 21.615589 | 20.388210 | 20.408886 | 0.019361 |
2 | 8062500124 | 0.005057 | 0.005016 | 0.005015 | 0.011148 | 0.005063 | 0.005023 | 20.437067 | 19.312630 | 19.709715 | 21.851952 | 18.770441 | 18.953411 | 0.036721 |
3 | 8062500186 | 0.005011 | 0.005007 | 0.005005 | 0.005477 | 0.005041 | 0.005014 | 19.128675 | 18.619995 | 18.803484 | 19.976501 | 18.479452 | 18.546589 | 0.039469 |
4 | 8062500248 | 0.005182 | 0.005118 | 0.005084 | 0.015486 | 0.006211 | 0.005308 | 21.242783 | 20.731707 | 20.911802 | 22.294912 | 20.645004 | 20.700289 | 0.026994 |
Next, let's set up the Data Store, so that our RAIL module will know where to fetch data. We will set "allow overwrite" so that we can overwrite data files and not throw errors while in our jupyter notebook:
#import RailStage stuff
from rail.core.data import TableHandle
from rail.core.stage import RailStage
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
We need to add our data to the DataStore, we can add previously read data, like our traindata_pq
, or add data to the DataStore directly via the DS.read_file
method, which we will do with our "test data". We can add data with DS.add_data
for the data already in memory, we want our data in a Numpy Ordered Dict, so we will specify the type as a TableHandle. If, instead, we were storing a qp ensemble then we would set the handle as a QPHandle
. The DataHandles are defined in RAIL/rail/core/data.py, and you can see the specific code and DataHandles there.
#add data that is already read in
train_data = DS.add_data("train_data", traindata_io, TableHandle )
To read in data from file, we can use DS.read_file
, once again we want a TableHandle, and we can feed it the testFile
path defined in Cell #2 above:
#add test data directly to datastore from file:
test_data = DS.read_file("test_data", TableHandle, testFile)
Let's list the data abailable to us in the DataStore:
DS
DataStore { train_data:<class 'rail.core.data.TableHandle'> None, (d) test_data:<class 'rail.core.data.TableHandle'> /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/rail/examples/testdata/test_dc2_validation_9816.hdf5, (wd) }
Note that the DataStore is just a dictionary of the files. Each Handle object contains the actual data, which is accessible via the .data
property for that file. While not particularly designed for it, you can manipulate the data via these dictionaries, which is handy for on-the-fly exploration in notebooks.
For example, say we want to add an additional column to the train_data, say "FakeID" with a more simple identifier than the long ObjID that is contained the id
column:
train_data().keys()
numgals = len(train_data()['id'])
train_data()['FakeID'] = np.arange(numgals)
Let's convert our train_data to a pandas dataframe with tables_io, and our new "FakeID" column should now be present:
train_table = tables_io.convertObj(train_data(), tables_io.types.PD_DATAFRAME)
train_table.head()
id | mag_err_g_lsst | mag_err_i_lsst | mag_err_r_lsst | mag_err_u_lsst | mag_err_y_lsst | mag_err_z_lsst | mag_g_lsst | mag_i_lsst | mag_r_lsst | mag_u_lsst | mag_y_lsst | mag_z_lsst | redshift | FakeID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8062500000 | 0.005001 | 0.005001 | 0.005001 | 0.005046 | 0.005003 | 0.005001 | 16.960892 | 16.506310 | 16.653412 | 18.040369 | 16.423904 | 16.466377 | 0.020435 | 0 |
1 | 8062500062 | 0.005084 | 0.005075 | 0.005048 | 0.009552 | 0.005804 | 0.005193 | 20.709402 | 20.437565 | 20.533852 | 21.615589 | 20.388210 | 20.408886 | 0.019361 | 1 |
2 | 8062500124 | 0.005057 | 0.005016 | 0.005015 | 0.011148 | 0.005063 | 0.005023 | 20.437067 | 19.312630 | 19.709715 | 21.851952 | 18.770441 | 18.953411 | 0.036721 | 2 |
3 | 8062500186 | 0.005011 | 0.005007 | 0.005005 | 0.005477 | 0.005041 | 0.005014 | 19.128675 | 18.619995 | 18.803484 | 19.976501 | 18.479452 | 18.546589 | 0.039469 | 3 |
4 | 8062500248 | 0.005182 | 0.005118 | 0.005084 | 0.015486 | 0.006211 | 0.005308 | 21.242783 | 20.731707 | 20.911802 | 22.294912 | 20.645004 | 20.700289 | 0.026994 | 4 |
And there it is, a new "FakeID" column is now added to the end of the dataset, success!
Now that we have our data in place, we can use it in a RAIL stage. As an example, we'll estimate photo-z's for our data. Let's train the KNearNeighPDF
algorithm with our train_data, and then estimate photo-z's for the test_data. We need to make the RAIL stages for each of these steps, first we need to train/inform our nearest neighbor algorithm with the train_data:
from rail.estimation.algos.knnpz import Inform_KNearNeighPDF, KNearNeighPDF
inform_knn = Inform_KNearNeighPDF.make_stage(name='inform_knn', input='train_data',
nondetect_val=99.0, model='knnpz.pkl',
hdf5_groupname='')
inform_knn.inform(train_data)
split into 7669 training and 2556 validation samples finding best fit sigma and NNeigh... best fit values are sigma=0.03166666666666667 and numneigh=7 Inserting handle into data store. model_inform_knn: inprogress_knnpz.pkl, inform_knn
<rail.core.data.ModelHandle at 0x7f65ecb5f7c0>
Running the inform
method on the training data has crated the "knnpz.pkl" file, which contains our trained tree, along with the sigma
bandwidth parameter and the numneigh
(number of neighbors to use in the PDF estimation). In the future, you could skip the inform
stage and simply load this pkl file directly into the estimation stage to save time.
Now, let's stage and run the actual PDF estimation on the test data: NOTE: we have set hdf5_groupname to "photometry", as the original data does have all our our needed photometry in a single hdf5 group named "photometry"!
estimate_knn = KNearNeighPDF.make_stage(name='estimate_knn', hdf5_groupname='photometry', nondetect_val=99.0,
model='knnpz.pkl', output="KNNPZ_estimates.hdf5")
Inserting handle into data store. model: knnpz.pkl, estimate_knn
Note that we have specified the name of the output file here with the kwarg output="KNNPZ_estimates.hdf5"
if no output is specified then the DataStore will construct its own name based on the name of the stage, and it will also default to a particular storage format, in the case of many of the estimator codes this is a FITS file titled "output_[stage name].fits".
knn_estimated = estimate_knn.estimate(test_data)
Process 0 running estimator on chunk 0 - 10000 Process 0 estimating PZ PDF for rows 0 - 10,000 Inserting handle into data store. output_estimate_knn: inprogress_KNNPZ_estimates.hdf5, estimate_knn Process 0 running estimator on chunk 10000 - 20000 Process 0 estimating PZ PDF for rows 10,000 - 20,000 Process 0 running estimator on chunk 20000 - 20449 Process 0 estimating PZ PDF for rows 20,000 - 20,449
We have successfully estimated PDFs for the ~20,000 galaxies in the test file! Note that the PDFs are in qp
format! Also note that they have been written to disk as "KNNPZ_estimate.hdf5"; however, they are also still available to us via the knn_estimated
dataset in the datastore. Let's plot an example PDF from our data in the DataStore:
We can do a quick plot to check our photo-z's. Our qp Ensemble can be called by knn_estimated()
and is subsecuently stored in knn_estimated.data
, and the Ensemble can calculate the mode of each PDF if we give it a grid of redshift values to check, which we can plot against our true redshifts from the test data:
pzmodes = knn_estimated().mode(grid=np.linspace(0,3,301)).flatten()
true_zs = test_data()['photometry']['redshift']
plt.figure(figsize=(8,8))
plt.scatter(true_zs, pzmodes, label='photoz mode for KNearNeigh',s=2)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)
<matplotlib.legend.Legend at 0x7f65ecb27bb0>
As an alternative, we can read the data from file and make the same plot to show that you don't need to use the DataStore, you can, instead, operate on the output files:
newens = qp.read("KNNPZ_estimates.hdf5")
newpzmodes = newens.mode(grid=np.linspace(0,3,301))
plt.figure(figsize=(8,8))
plt.scatter(true_zs, newpzmodes, label='photoz mode for KNearNeigh',s=2)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)
<matplotlib.legend.Legend at 0x7f65eca23310>
That's about it. For more usages, including how to chain together multiple stages, feeding results one into the other with the DataStore names, see goldenspike.ipynb in the examples/goldenspike directory.