Data, files, IO, and RAIL¶

author: Sam Schmidt
Last successfully run: Apr 18, 2022

The switchover to a ceci-based backend has increased the complexity of methods of data access and IO, this notebook will demonstrate a variety of ways that users may interact with data in RAIL

In addition to the main RAIL code, we have developed another companion package, tables_io available here on Github.

tables_io aims to simplify IO for reading/writing to some of the most common file formats used within DESC: HDF5 files, parquet files, Astropy tables, and qp ensembles. There are several examples of tables_io usage in the nb directory of the tables_io repository, but we will demonstrate usage in several places in this notebook as well. For furthe examples consult the tables_io nb examples.

In short, tables_io aims to simplify fileIO, and much of the io is automatically sorted out for you if your files have the appriorate extensions: that is, you can simply do a tables_io.read("file.fits") to read in a fits file or tables_io.read("newfile.pq") to read in a dataframe in parquet format. Similarly, you can specify the output format via the extension as well. This functionality is extended to qp and RAIL through their use of tables_io, and file extensions will control how files are read and written unless explicitly overridden.

Another concept used in the ceci-based RAIL when used in a Jupyter Notebook is the DataStore and DataHandle file specifications (see RAIL/rail/core/data.py for the actual code implementing these). ceci requires that each pipeline stage have defined input and output files, and is primarily geared toward pipelines rather than interactive runs with a jupyter notebook. The DataStore enables interactive use of files in Jupyter. We will demonstrate some useful features of the DataStore below.

Let's start out with some imports:

In [1]:
import os
import tables_io
import rail
import qp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

First, let's use tables_io to read in some example data. There are two example files that ship with RAIL containing a small amount of cosmoDC2 data from healpix pixel 9816, it is located in the RAIL/tests/data/ directory in the RAIL repository, one for "training" and one for "validation". Let's read in one of those data files with tables_io:

(NOTE: for historical reasons, our examples files have data that is in hdf5 format where all of the data arrays are actually in a single hdf5 group named "photometry". We will grab the data specifically from that hdf5 group by reading in the file and specifying ["photometry"] as the group in the cell below. We'll call our dataset "traindata_io" to indicate that we've read it in via tables_io, and distinguish it from the data that we'll place in the DataStore in later steps:

In [2]:
from rail.core.utils import RAILDIR
trainFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAILDIR, 'rail/examples/testdata/test_dc2_validation_9816.hdf5')

traindata_io = tables_io.read(trainFile)["photometry"]

tables_io reads this data in as an ordered dictionary of numpy arrays by default, though you can be converted to other data formats, such as a pandas dataframe as well. Let's print out the keys in the ordered dict showing the available columns, then convert the data to a pandas dataframe and look at a few of the columns as a demo:

In [3]:
traindata_io.keys()
Out[3]:
odict_keys(['id', 'mag_err_g_lsst', 'mag_err_i_lsst', 'mag_err_r_lsst', 'mag_err_u_lsst', 'mag_err_y_lsst', 'mag_err_z_lsst', 'mag_g_lsst', 'mag_i_lsst', 'mag_r_lsst', 'mag_u_lsst', 'mag_y_lsst', 'mag_z_lsst', 'redshift'])
In [4]:
traindata_pq = tables_io.convert(traindata_io, tables_io.types.PD_DATAFRAME)
In [5]:
traindata_pq.head()
Out[5]:
id mag_err_g_lsst mag_err_i_lsst mag_err_r_lsst mag_err_u_lsst mag_err_y_lsst mag_err_z_lsst mag_g_lsst mag_i_lsst mag_r_lsst mag_u_lsst mag_y_lsst mag_z_lsst redshift
0 8062500000 0.005001 0.005001 0.005001 0.005046 0.005003 0.005001 16.960892 16.506310 16.653412 18.040369 16.423904 16.466377 0.020435
1 8062500062 0.005084 0.005075 0.005048 0.009552 0.005804 0.005193 20.709402 20.437565 20.533852 21.615589 20.388210 20.408886 0.019361
2 8062500124 0.005057 0.005016 0.005015 0.011148 0.005063 0.005023 20.437067 19.312630 19.709715 21.851952 18.770441 18.953411 0.036721
3 8062500186 0.005011 0.005007 0.005005 0.005477 0.005041 0.005014 19.128675 18.619995 18.803484 19.976501 18.479452 18.546589 0.039469
4 8062500248 0.005182 0.005118 0.005084 0.015486 0.006211 0.005308 21.242783 20.731707 20.911802 22.294912 20.645004 20.700289 0.026994

Next, let's set up the Data Store, so that our RAIL module will know where to fetch data. We will set "allow overwrite" so that we can overwrite data files and not throw errors while in our jupyter notebook:

In [6]:
#import RailStage stuff
from rail.core.data import TableHandle
from rail.core.stage import RailStage
In [7]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

We need to add our data to the DataStore, we can add previously read data, like our traindata_pq, or add data to the DataStore directly via the DS.read_file method, which we will do with our "test data". We can add data with DS.add_data for the data already in memory, we want our data in a Numpy Ordered Dict, so we will specify the type as a TableHandle. If, instead, we were storing a qp ensemble then we would set the handle as a QPHandle. The DataHandles are defined in RAIL/rail/core/data.py, and you can see the specific code and DataHandles there.

In [8]:
#add data that is already read in
train_data = DS.add_data("train_data", traindata_io, TableHandle )

To read in data from file, we can use DS.read_file, once again we want a TableHandle, and we can feed it the testFile path defined in Cell #2 above:

In [9]:
#add test data directly to datastore from file:
test_data = DS.read_file("test_data", TableHandle, testFile)

Let's list the data abailable to us in the DataStore:

In [10]:
DS
Out[10]:
DataStore
{  train_data:<class 'rail.core.data.TableHandle'> None, (d)
  test_data:<class 'rail.core.data.TableHandle'> /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/rail/examples/testdata/test_dc2_validation_9816.hdf5, (wd)
}

Note that the DataStore is just a dictionary of the files. Each Handle object contains the actual data, which is accessible via the .data property for that file. While not particularly designed for it, you can manipulate the data via these dictionaries, which is handy for on-the-fly exploration in notebooks.
For example, say we want to add an additional column to the train_data, say "FakeID" with a more simple identifier than the long ObjID that is contained the id column:

In [11]:
train_data().keys()
numgals = len(train_data()['id'])
train_data()['FakeID'] = np.arange(numgals)

Let's convert our train_data to a pandas dataframe with tables_io, and our new "FakeID" column should now be present:

In [12]:
train_table = tables_io.convertObj(train_data(), tables_io.types.PD_DATAFRAME)
train_table.head()
Out[12]:
id mag_err_g_lsst mag_err_i_lsst mag_err_r_lsst mag_err_u_lsst mag_err_y_lsst mag_err_z_lsst mag_g_lsst mag_i_lsst mag_r_lsst mag_u_lsst mag_y_lsst mag_z_lsst redshift FakeID
0 8062500000 0.005001 0.005001 0.005001 0.005046 0.005003 0.005001 16.960892 16.506310 16.653412 18.040369 16.423904 16.466377 0.020435 0
1 8062500062 0.005084 0.005075 0.005048 0.009552 0.005804 0.005193 20.709402 20.437565 20.533852 21.615589 20.388210 20.408886 0.019361 1
2 8062500124 0.005057 0.005016 0.005015 0.011148 0.005063 0.005023 20.437067 19.312630 19.709715 21.851952 18.770441 18.953411 0.036721 2
3 8062500186 0.005011 0.005007 0.005005 0.005477 0.005041 0.005014 19.128675 18.619995 18.803484 19.976501 18.479452 18.546589 0.039469 3
4 8062500248 0.005182 0.005118 0.005084 0.015486 0.006211 0.005308 21.242783 20.731707 20.911802 22.294912 20.645004 20.700289 0.026994 4

And there it is, a new "FakeID" column is now added to the end of the dataset, success!

Using the data in a pipeline stage: photo-z estimation example¶

Now that we have our data in place, we can use it in a RAIL stage. As an example, we'll estimate photo-z's for our data. Let's train the KNearNeighPDF algorithm with our train_data, and then estimate photo-z's for the test_data. We need to make the RAIL stages for each of these steps, first we need to train/inform our nearest neighbor algorithm with the train_data:

In [13]:
from rail.estimation.algos.knnpz import Inform_KNearNeighPDF, KNearNeighPDF
In [14]:
inform_knn = Inform_KNearNeighPDF.make_stage(name='inform_knn', input='train_data', 
                                            nondetect_val=99.0, model='knnpz.pkl',
                                            hdf5_groupname='')
In [15]:
inform_knn.inform(train_data)
split into 7669 training and 2556 validation samples
finding best fit sigma and NNeigh...



best fit values are sigma=0.03166666666666667 and numneigh=7



Inserting handle into data store.  model_inform_knn: inprogress_knnpz.pkl, inform_knn
Out[15]:
<rail.core.data.ModelHandle at 0x7f65ecb5f7c0>

Running the inform method on the training data has crated the "knnpz.pkl" file, which contains our trained tree, along with the sigma bandwidth parameter and the numneigh (number of neighbors to use in the PDF estimation). In the future, you could skip the inform stage and simply load this pkl file directly into the estimation stage to save time.

Now, let's stage and run the actual PDF estimation on the test data: NOTE: we have set hdf5_groupname to "photometry", as the original data does have all our our needed photometry in a single hdf5 group named "photometry"!

In [16]:
estimate_knn = KNearNeighPDF.make_stage(name='estimate_knn', hdf5_groupname='photometry', nondetect_val=99.0,
                                        model='knnpz.pkl', output="KNNPZ_estimates.hdf5")
Inserting handle into data store.  model: knnpz.pkl, estimate_knn

Note that we have specified the name of the output file here with the kwarg output="KNNPZ_estimates.hdf5" if no output is specified then the DataStore will construct its own name based on the name of the stage, and it will also default to a particular storage format, in the case of many of the estimator codes this is a FITS file titled "output_[stage name].fits".

In [17]:
knn_estimated = estimate_knn.estimate(test_data)
Process 0 running estimator on chunk 0 - 10000
Process 0 estimating PZ PDF for rows 0 - 10,000
Inserting handle into data store.  output_estimate_knn: inprogress_KNNPZ_estimates.hdf5, estimate_knn
Process 0 running estimator on chunk 10000 - 20000
Process 0 estimating PZ PDF for rows 10,000 - 20,000
Process 0 running estimator on chunk 20000 - 20449
Process 0 estimating PZ PDF for rows 20,000 - 20,449

We have successfully estimated PDFs for the ~20,000 galaxies in the test file! Note that the PDFs are in qp format! Also note that they have been written to disk as "KNNPZ_estimate.hdf5"; however, they are also still available to us via the knn_estimated dataset in the datastore. Let's plot an example PDF from our data in the DataStore:

We can do a quick plot to check our photo-z's. Our qp Ensemble can be called by knn_estimated() and is subsecuently stored in knn_estimated.data, and the Ensemble can calculate the mode of each PDF if we give it a grid of redshift values to check, which we can plot against our true redshifts from the test data:

In [18]:
pzmodes = knn_estimated().mode(grid=np.linspace(0,3,301)).flatten()
true_zs = test_data()['photometry']['redshift']
In [19]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs, pzmodes, label='photoz mode for KNearNeigh',s=2)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)
Out[19]:
<matplotlib.legend.Legend at 0x7f65ecb27bb0>

As an alternative, we can read the data from file and make the same plot to show that you don't need to use the DataStore, you can, instead, operate on the output files:

In [20]:
newens = qp.read("KNNPZ_estimates.hdf5")
newpzmodes = newens.mode(grid=np.linspace(0,3,301))
In [21]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs, newpzmodes, label='photoz mode for KNearNeigh',s=2)
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)
Out[21]:
<matplotlib.legend.Legend at 0x7f65eca23310>

That's about it. For more usages, including how to chain together multiple stages, feeding results one into the other with the DataStore names, see goldenspike.ipynb in the examples/goldenspike directory.

In [ ]: