author: John Franklin Crenshaw, Sam Schmidt, Eric Charles, others...
last run successfully: March 16, 2022
This notebook demonstrates how to use a RAIL Engines to create galaxy samples, and how to use Degraders to add various errors and biases to the sample.
Note that in the parlance of the Creation Module, "degradation" is any post-processing that occurs to the "true" sample generated by the Engine. This can include adding photometric errors, applying quality cuts, introducing systematic biases, etc.
In this notebook, we will first learn how to draw samples from a RAIL Engine object. Then we will demonstrate how to use the following RAIL Degraders:
Throughout the notebook, we will show how you can chain all these Degraders together to build a more complicated degrader. Hopefully, this will allow you to see how you can build your own degrader.
Note on generating redshift posteriors: regardless of what Degraders you apply, when you use a Creator to estimate posteriors, the posteriors will always be calculated with respect to the "true" distribution. This is the whole point of the Creation Module -- you can generate degraded samples for which we still have access to the true posteriors. For an example of how to calculate posteriors, see posterior-demo.ipynb
.
import matplotlib.pyplot as plt
from pzflow.examples import get_example_flow
from rail.creation.engines.flowEngine import FlowCreator
from rail.creation.degradation import (
InvRedshiftIncompleteness,
LineConfusion,
LSSTErrorModel,
QuantityCut,
)
from rail.core.stage import RailStage
import pzflow
import os
flow_file = os.path.join(os.path.dirname(pzflow.__file__), 'example_files', 'example-flow.pzflow.pkl')
We'll start by setting up the Rail data store. RAIL uses ceci, which is designed for pipelines rather than interactive notebooks, the data store will work around that and enable us to use data interactively. See the rail/examples/goldenspike/goldenspike.ipynb
example notebook for more details on the Data Store.
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
First, let's make an Engine that has no degradation. We can use it to generate a "true" sample, to which we can compare all the degraded samples below.
Note: in this example, we will use a normalizing flow engine from the pzflow package. However, everything in this notebook is totally agnostic to what the underlying engine is.
The Engine is a type of RailStage object, so we can make one using the RailStage.make_stage
function for the class of Engine that we want. We then pass in the configuration parameters as arguments to make_stage
.
n_samples = int(1e5)
flowCreator_truth = FlowCreator.make_stage(name='truth', model=flow_file, n_samples=n_samples)
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Inserting handle into data store. model: /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pzflow/example_files/example-flow.pzflow.pkl, truth
flowCreator_truth.get_data('model')
<pzflow.flow.Flow at 0x7f1b0495e6a0>
sample
method to generate some samples¶Note that this will return a DataHandle
object, which can keep both the data itself, and also the path to where the data is written. When talking to rail stages we can use this as though it were the underlying data and pass it as an argument. This allows the rail stages to keep track of where their inputs are coming from.
samples_truth = flowCreator_truth.sample(n_samples, seed=0)
print(samples_truth())
print("Data was written to ", samples_truth.path)
Inserting handle into data store. output_truth: inprogress_output_truth.pq, truth redshift u g r i z \ 0 0.890625 27.370831 26.712662 26.025223 25.327188 25.016500 1 1.978239 29.557049 28.361185 27.587231 27.238544 26.628109 2 0.974287 26.566015 25.937716 24.787413 23.872456 23.139563 3 1.317979 29.042730 28.274593 27.501106 26.648790 26.091450 4 1.386366 26.292624 25.774778 25.429958 24.806530 24.367950 ... ... ... ... ... ... ... 99995 2.147172 26.550978 26.349937 26.135286 26.082020 25.911032 99996 1.457508 27.362207 27.036276 26.823139 26.420132 26.110037 99997 1.372992 27.736044 27.271955 26.887581 26.416138 26.043434 99998 0.855022 28.044552 27.327116 26.599014 25.862331 25.592169 99999 1.723768 27.049067 26.526745 26.094595 25.642971 25.197956 y 0 24.926821 1 26.248560 2 22.832047 3 25.346500 4 23.700010 ... ... 99995 25.558136 99996 25.524904 99997 25.456165 99998 25.506388 99999 24.900501 [100000 rows x 7 columns] Data was written to output_truth.pq
Now, we will demonstrate the LSSTErrorModel
, which adds photometric errors using a model similar to the model from Ivezic et al. 2019 (specifically, it uses the model from this paper, without making the high SNR assumption. To restore this assumption and therefore use the exact model from the paper, set highSNR=True
.)
Let's create an error model with the default settings:
errorModel = LSSTErrorModel.make_stage(name='error_model')
To see the details of the model, including the default settings we are using, you can just print the model:
errorModel
LSSTErrorModel parameters: Model for bands: u, g, r, i, z, y Using error type point Exposure time = 30.0 s Number of years of observations = 10.0 Mean visits per year per band: u: 5.6, g: 8.0, r: 18.4, i: 18.4, z: 16.0, y: 16.0 Airmass = 1.2 Irreducible system error = 0.005 Magnitudes dimmer than 30.0 are set to nan gamma for each band: u: 0.038, g: 0.039, r: 0.039, i: 0.039, z: 0.039, y: 0.039 The coadded 5-sigma limiting magnitudes are: u: 26.04, g: 27.29, r: 27.31, i: 26.87, z: 26.23, y: 25.30 The following single-visit 5-sigma limiting magnitudes are calculated using the parameters that follow them: u: 23.83, g: 24.90, r: 24.47, i: 24.03, z: 23.46, y: 22.53 Cm for each band: u: 23.09, g: 24.42, r: 24.44, i: 24.32, z: 24.16, y: 23.73 Median zenith sky brightness in each band: u: 22.99, g: 22.26, r: 21.2, i: 20.48, z: 19.6, y: 18.61 Median zenith seeing FWHM (in arcseconds) for each band: u: 0.81, g: 0.77, r: 0.73, i: 0.71, z: 0.69, y: 0.68 Extinction coefficient for each band: u: 0.491, g: 0.213, r: 0.126, i: 0.096, z: 0.069, y: 0.17
Now let's add this error model as a degrader and draw some samples with photometric errors.
samples_w_errs = errorModel(samples_truth)
samples_w_errs()
Inserting handle into data store. output_error_model: inprogress_output_error_model.pq, error_model
redshift | u | u_err | g | g_err | r | r_err | i | i_err | z | z_err | y | y_err | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.890625 | NaN | NaN | 26.562721 | 0.105583 | 26.084861 | 0.068194 | 25.340978 | 0.052257 | 25.021891 | 0.069445 | 25.047443 | 0.159796 |
1 | 1.978239 | NaN | NaN | 28.038419 | 0.362520 | 27.490722 | 0.229680 | 28.102581 | 0.525461 | 26.066428 | 0.172483 | 25.834953 | 0.307316 |
2 | 0.974287 | 26.873697 | 0.389236 | 25.882633 | 0.057988 | 24.797719 | 0.021944 | 23.873355 | 0.014716 | 23.128763 | 0.013557 | 22.861474 | 0.023448 |
3 | 1.317979 | 27.914048 | 0.817339 | 27.705399 | 0.277971 | 27.204204 | 0.180633 | 26.703293 | 0.172092 | 25.931166 | 0.153677 | 25.795159 | 0.297649 |
4 | 1.386366 | 26.336934 | 0.253759 | 25.750773 | 0.051593 | 25.483414 | 0.039993 | 24.809233 | 0.032626 | 24.301733 | 0.036670 | 23.576059 | 0.043921 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 2.147172 | 26.643909 | 0.325091 | 26.212954 | 0.077661 | 26.220695 | 0.076900 | 26.027656 | 0.095907 | 26.102146 | 0.177794 | 25.635739 | 0.261534 |
99996 | 1.457508 | 26.621966 | 0.319467 | 26.982388 | 0.151845 | 26.542811 | 0.102093 | 26.446734 | 0.138137 | 25.959232 | 0.157414 | 25.461991 | 0.226646 |
99997 | 1.372992 | 26.679523 | 0.334399 | 27.416936 | 0.219265 | 27.042587 | 0.157411 | 26.480484 | 0.142215 | 26.165722 | 0.187622 | 24.902178 | 0.141068 |
99998 | 0.855022 | 26.886674 | 0.393155 | 27.355825 | 0.208363 | 26.494891 | 0.097896 | 25.783669 | 0.077364 | 25.514723 | 0.107157 | 25.333237 | 0.203557 |
99999 | 1.723768 | 27.557109 | 0.643300 | 26.442709 | 0.095055 | 26.216528 | 0.076618 | 25.710465 | 0.072517 | 25.169914 | 0.079153 | 24.799610 | 0.129108 |
100000 rows × 13 columns
Notice some of the magnitudes are NaN's. These are non-detections. This means those observed magnitudes were beyond the 30mag limit that is default in LSSTErrorModel
.
You can change this limit and the corresponding flag by setting magLim=...
and ndFlag=...
in the constructor for LSSTErrorModel
.
Let's plot the error as a function of magnitude
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
for band in "ugrizy":
# pull out the magnitudes and errors
mags = samples_w_errs.data[band].to_numpy()
errs = samples_w_errs.data[band + "_err"].to_numpy()
# sort them by magnitude
mags, errs = mags[mags.argsort()], errs[mags.argsort()]
# plot errs vs mags
ax.plot(mags, errs, label=band)
ax.legend()
ax.set(xlabel="Magnitude (AB)", ylabel="Error (mags)")
plt.show()
You can see that the photometric error increases as magnitude gets dimmer, just like you would expect. Notice, however, that we have galaxies as dim as magnitude 30. This is because the Flow produces a sample much deeper than the LSST 5-sigma limiting magnitudes. There are no galaxies dimmer than magnitude 30 because LSSTErrorModel sets magnitudes > 30 equal to NaN (the default flag for non-detections).
Recall how the sample above has galaxies as dim as magnitude 30. This is well beyond the LSST 5-sigma limiting magnitudes, so it will be useful to apply cuts to the data to filter out these super-dim samples. We can apply these cuts using the QuantityCut
degrader. This degrader will cut out any samples that do not pass all of the specified cuts.
Let's make and run degraders that first adds photometric errors, then cuts at i<25.3, which is the LSST gold sample.
gold_cut = QuantityCut.make_stage(name='cuts', cuts={"i": 25.3})
Now we can stick this into a Creator and draw a new sample
samples_gold_w_errs = gold_cut(samples_w_errs)
Inserting handle into data store. output_cuts: inprogress_output_cuts.pq, cuts
If you look at the i column, you will see there are no longer any samples with i > 25.3. The number of galaxies returned has been nearly cut in half from the input sample and, unlike the LSSTErrorModel degrader, is not equal to the number of input objects. Users should note that with degraders that remove galaxies from the sample the size of the output sample will not equal that of the input sample.
One more note: it is easy to use the QuantityCut degrader as a SNR cut on the magnitudes. The magnitude equation is $m = -2.5 \log(f)$. Taking the derivative, we have
$$
dm = \frac{2.5}{\ln(10)} \frac{df}{f} = \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}.
$$
So if you want to make a cut on galaxies above a certain SNR, you can make a cut
$$
dm < \frac{2.5}{\ln(10)} \frac{1}{\mathrm{SNR}}.
$$
For example, an SNR cut on the i band would look like this: QuantityCut({"i_err": 2.5/np.log(10) * 1/SNR})
.
Next, we will demonstrate the InvRedshiftIncompleteness
degrader. It applies a selection function, which keeps galaxies with probability $p_{\text{keep}}(z) = \min(1, \frac{z_p}{z})$, where $z_p$ is the ''pivot'' redshift. We'll use $z_p = 0.8$.
inv_incomplete = InvRedshiftIncompleteness.make_stage(name='incompleteness', pivot_redshift=0.8)
samples_incomplete_gold_w_errs = inv_incomplete(samples_gold_w_errs)
Inserting handle into data store. output_incompleteness: inprogress_output_incompleteness.pq, incompleteness
Let's plot the redshift distributions of the samples we have generated so far:
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
zmin = 0
zmax = 2.5
hist_settings = {
"bins": 50,
"range": (zmin, zmax),
"density": True,
"histtype": "step",
}
ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()
You can see that the Gold sample has significantly fewer high-redshift galaxies than the truth. This is because many of the high-redshift galaxies have i > 25.3.
You can further see that the Incomplete Gold sample has even fewer high-redshift galaxies. This is exactly what we expected from this degrader.
LineConfusion
is a degrader that simulates spectroscopic errors resulting from the confusion of different emission lines.
For this example, let's use the degrader to simulate a scenario in which which 2% of [OII] lines are mistaken as [OIII] lines, and 1% of [OIII] lines are mistaken as [OII] lines. (note I do not know how realistic this scenario is!)
OII = 3727
OIII = 5007
lc_2p_0II_0III = LineConfusion.make_stage(name='lc_2p_0II_0III',
true_wavelen=OII, wrong_wavelen=OIII, frac_wrong=0.02)
lc_1p_0III_0II = LineConfusion.make_stage(name='lc_1p_0III_0II',
true_wavelen=OIII, wrong_wavelen=OII, frac_wrong=0.01)
samples_conf_inc_gold_w_errs = lc_1p_0III_0II(lc_2p_0II_0III(samples_incomplete_gold_w_errs))
Inserting handle into data store. output_lc_2p_0II_0III: inprogress_output_lc_2p_0II_0III.pq, lc_2p_0II_0III Inserting handle into data store. output_lc_1p_0III_0II: inprogress_output_lc_1p_0III_0II.pq, lc_1p_0III_0II
Let's plot the redshift distributions one more time
fig, ax = plt.subplots(figsize=(5, 4), dpi=100)
zmin = 0
zmax = 2.5
hist_settings = {
"bins": 50,
"range": (zmin, zmax),
"density": True,
"histtype": "step",
}
ax.hist(samples_truth()["redshift"], label="Truth", **hist_settings)
ax.hist(samples_gold_w_errs()["redshift"], label="Gold", **hist_settings)
ax.hist(samples_incomplete_gold_w_errs()["redshift"], label="Incomplete Gold", **hist_settings)
ax.hist(samples_conf_inc_gold_w_errs()["redshift"], label="Confused Incomplete Gold", **hist_settings)
ax.legend(title="Sample")
ax.set(xlim=(zmin, zmax), xlabel="Redshift", ylabel="Galaxy density")
plt.show()
You can see that the redshift distribution of this new sample is essentially identical to the Incomplete Gold sample, with small perturbations that result from the line confusion.
However the real impact of this degrader isn't on the redshift distribution, but rather that it introduces erroneous spec-z's into the photo-z training sets! To see the impact of this effect, let's plot the true spec-z's as present in the Incomplete Gold sample, vs the spec-z's listed in the new sample with Oxygen Line Confusion.
fig, ax = plt.subplots(figsize=(6, 6), dpi=85)
ax.scatter(samples_incomplete_gold_w_errs()["redshift"], samples_conf_inc_gold_w_errs()["redshift"],
marker=".", s=1)
ax.set(
xlim=(0, 2.5), ylim=(0, 2.5),
xlabel="True spec-z (in Incomplete Gold sample)",
ylabel="Spec-z listed in the Confused sample",
)
plt.show()
Now we can clearly see the spec-z errors! The galaxies above the line y=x are the [OII] -> [OIII] galaxies, while the ones below are the [OIII] -> [OII] galaxies.