rainforest.ml package¶
This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.
rainforest.ml.rf
: main module used to train RF regressors
rainforest.ml.rfdefinitions
: reference module that contains definitions of the RF regressors and allows to load them from files
rainforest.ml.rf_train
: command-line utility to train RF models and prepare input features
rainforest.ml.utils
: small utilities used in this module only (for example for vertical aggregation)
rainforest.ml.rf module¶
Main module to
-
class
rainforest.ml.rf.
RFTraining
(db_location, input_location=None, force_regenerate_input=False)¶ Bases:
object
This is the main class that allows to preparate data for random forest training, train random forests and perform cross-validation of trained models
Initializes the class and if needed prepare input data for the training
Note that when calling this constructor the input data is only generated for the central pixel (NX = NY = 0 = loc of gauge), if you want to regenerate the inputs for all neighbour pixels, please call the function self.prepare_input(only_center_pixel = False)
- Parameters
db_location (str) – Location of the main directory of the database (with subfolders ‘reference’, ‘gauge’ and ‘radar’ on the filesystem)
input_location (str) – Location of the prepared input data, if this data can not be found in this folder, it will computed here, default is a subfolder called rf_input_data within db_location
force_regenerate_input (bool) – if True the input parquet files will always be regenerated from the database even if already present in the input_location folder
-
fit_models
(config_file, features_dic, tstart=None, tend=None, output_folder=None)¶ Fits a new RF model that can be used to compute QPE realizations and saves them to disk in pickle format
- Parameters
config_file (str) – Location of the RF training configuration file, if not provided the default one in the ml submodule will be used
features_dic (dict) – A dictionary whose keys are the names of the models you want to create (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will train a model with all these features that will then be stored under the name RF_dualpol_BC_<type of BC>.p in the ml/rf_models dir
tstart (datetime) – the starting time of the training time interval, default is to start at the beginning of the time interval covered by the database
tend (datetime) – the end time of the training time interval, default is to end at the end of the time interval covered by the database
output_folder (str) – Location where to store the trained models in pickle format, if not provided it will store them in the standard location <library_path>/ml/rf_models
-
model_intercomparison
(features_dic, intercomparison_configfile, output_folder, reference_products=['CPC', 'RZC'], bounds10=[0, 2, 10, 100], bounds60=[0, 1, 10, 100], K=5)¶ Does an intercomparison (cross-validation) of different RF models and reference products (RZC, CPC, …) and plots the performance plots
- Parameters
features_dic (dict) – A dictionary whose keys are the names of the models you want to compare (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’], ‘RF_hpol’:[‘RADAR’, ‘zh_VISIB_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will compare a model of RF with polarimetric info to a model with only horizontal polarization
output_folder (str) – Location where to store the output plots
intercomparison_config (str) – Location of the intercomparison configuration file, which is a yaml file that gives for every model key of features_dic which parameters of the training you want to use (see the file intercomparison_config_example.yml in this module for an example)
reference_products (list of str) – Name of the reference products to which the RF will be compared they need to be in the reference table of the database
bounds10 (list of float) – list of precipitation bounds for which to compute scores separately at 10 min time resolution [0,2,10,100] will give scores in range [0-2], [2-10] and [10-100]
bounds60 (list of float) – list of precipitation bounds for which to compute scores separately at hourly time resolution [0,1,10,100] will give scores in range [0-1], [1-10] and [10-100]
K (int) – Number of splits in iterations do perform in the K fold cross-val
-
prepare_input
(only_center=True)¶ Reads the data from the database in db_location and processes it to create easy to use parquet input files for the ML training and stores them in the input_location, the processing steps involve
For every neighbour of the station (i.e. from -1-1 to +1+1):
Replace missing flags by nans
Filter out timesteps which are not present in the three tables (gauge, reference and radar)
Filter out incomplete hours (i.e. where less than 6 10 min timesteps are available)
Add height above ground and height of iso0 to radar data
Save a separate parquet file for radar, gauge and reference data
Save a grouping_idx pickle file containing grp_vertical index (groups all radar rows with same timestep and station), grp_hourly (groups all timesteps with same hours) and tstamp_unique (list of all unique timestamps)
- Parameters
only_center (bool) – If set to True only the input data for the central neighbour i.e. NX = NY = 0 (the location of the gauge) will be recomputed this takes much less time and is the default option since until now the neighbour values are not used in the training of the RF QPE
rainforest.ml.rf_train module¶
Command line script to prepare input features and train RF models
see rf_train
-
rainforest.ml.rf_train.
main
()¶
rainforest.ml.rfdefinitions module¶
Class declarations and reading functions required to unpickle trained RandomForest models
Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019
-
class
rainforest.ml.rfdefinitions.
MyCustomUnpickler
¶ Bases:
_pickle.Unpickler
This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class
-
find_class
(module, name)¶ Return an object from a specified module.
If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).
This method is called whenever a class or a function object is needed. Both arguments passed are str objects.
-
-
class
rainforest.ml.rfdefinitions.
RandomForestRegressorBC
(variables, beta, degree=1, bctype='cdf', n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)¶ Bases:
sklearn.ensemble.forest.RandomForestRegressor
This is an extension of the RandomForestRegressor regressor class of sklearn that does additional bias correction, is able to apply a rounding function to the outputs on the fly and adds a bit of metadata:
bctype : type of bias correction method variables : name of input features beta : weighting factor in vertical aggregation degree : order of the polyfit used in some bias-correction methods
For bc_type tHe available methods are currently “raw”: simple linear fit between prediction and observation, “cdf”: linear fit between sorted predictions and sorted observations and “spline” : spline fit between sorted predictions and sorted observations. Any new method should be added in this class in order to be used.
For any information regarding the sklearn parent class see
https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/ensemble/_forest.py#L1150
-
fit
(X, y, sample_weight=None)¶ Fit both estimator and a-posteriori bias correction :param X: The input samples. Use
dtype=np.float32
for maximumefficiency. Sparse matrices are also supported, use sparse
csc_matrix
for maximum efficiency.- Parameters
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.
- Returns
self
- Return type
object
-
predict
(X, round_func=None, bc=True)¶ Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Parameters
round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)
bc (bool) – if True the bias correction function will be applied
- Returns
y – The predicted values.
- Return type
array-like of shape (n_samples,) or (n_samples, n_outputs)
-
-
rainforest.ml.rfdefinitions.
read_rf
(rf_name)¶ Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py
- Parameters
rf_name (str) – Name of the randomForest model, it must be stored in the folder /ml/rf_models and computed with the rf:RFTraining.fit_model function
- Returns
A trained sklearn randomForest instance that has the predict() method,
that allows to predict precipitation intensities for new points
rainforest.ml.utils module¶
Utility functions for the ML submodule
-
rainforest.ml.utils.
nesteddictvalues
(d)¶
-
rainforest.ml.utils.
split_event
(timestamps, n=5, threshold_hr=12)¶ Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets
- Parameters
timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
n (int) – number of subsets to create
threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.
- Returns
split_idx – array containing the subset grouping, with values from 0 to n - 1
- Return type
int array
-
rainforest.ml.utils.
vert_aggregation
(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)¶ Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar
- Parameters
radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module
vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data
grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label
visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground
visib (np array) – visibily of every observation, required only if visib_weight = True