13  HPT: River

River is a Python library for online machine learning (Montiel et al. 2021). It aims to be the most user-friendly library for doing machine learning on streaming data. River is the result of a merger between creme and scikit-multiflow.

13.1 Step 1: Setup

Before we consider the detailed experimental setup, we select the parameters that affect run time, initial design size and the device that is used.

Caution: Run time and initial design size should be increased for real experiments
  • MAX_TIME is set to one minute for demonstration purposes. For real experiments, this should be increased to at least 1 hour.
  • INIT_SIZE is set to 5 for demonstration purposes. For real experiments, this should be increased to at least 10.
  • K is set to 0.1 for demonstration purposes. For real experiments, this should be increased to at least 1.
MAX_TIME = 1
INIT_SIZE = 5
K = .1
10-river_bartz09_1min_5init_2023-06-19_03-02-03

13.1.1 river Hyperparameter Tuning: HATR with Friedman Drift Data

pip list | grep  "spot[RiverPython]"
spotPython                 0.2.38
spotRiver                  0.0.93
Note: you may need to restart the kernel to use updated packages.
# import sys
# !{sys.executable} -m pip install --upgrade build
# !{sys.executable} -m pip install --upgrade --force-reinstall spotPython

13.2 Step 2: Initialization of the fun_control Dictionary

from spotPython.utils.init import fun_control_init
fun_control = fun_control_init(task="regression",
 tensorboard_path=None)

13.3 Step 3: Load the Friedman Drift Data

horizon = 7*24
k = K
n_total = int(k*100_000)
n_samples = n_total
p_1 = int(k*25_000)
p_2 = int(k*50_000)
position=(p_1, p_2)
n_train = 1_000
a = n_train + p_1 - 12
b = a + 12
  • Since we also need a river version of the data below for plotting the model, the corresponding data set is generated here. Note: spotRiver uses the train and test data sets, while river uses the X and y data sets
from river.datasets import synth
import pandas as pd
dataset = synth.FriedmanDrift(
   drift_type='gra',
   position=position,
     seed=123
)
data_dict = {key: [] for key in list(dataset.take(1))[0][0].keys()}
data_dict["y"] = []
for x, y in dataset.take(n_total):
    for key, value in x.items():
        data_dict[key].append(value)
    data_dict["y"].append(y)
df = pd.DataFrame(data_dict)
# Add column names x1 until x10 to the first 10 columns of the dataframe and the column name y to the last column
df.columns = [f"x{i}" for i in range(1, 11)] + ["y"]

train = df[:n_train]
test = df[n_train:]
target_column = "y"
#
fun_control.update({"data": None, # dataset,
               "train": train,
               "test": test,
               "n_samples": n_samples,
               "target_column": target_column})

13.4 Step 4: Specification of the Preprocessing Model

from river import preprocessing
prep_model = preprocessing.StandardScaler()
fun_control.update({"prep_model": prep_model})

13.5 Step 5: Select algorithm and core_model_hyper_dict

  • The river model (HATR) is selected.
  • Furthermore, the corresponding hyperparameters, see: https://riverml.xyz/0.15.0/api/tree/HoeffdingTreeRegressor/ are selected (incl. type information, names, and bounds).
  • The corresponding hyperparameter dictionary is added to the fun_control dictionary.
  • Alternatively, you can load a local hyper_dict. Simply set river_hyper_dict.json as the filename. If filenameis set to None, the hyper_dict is loaded from the spotRiver package.
from river.tree import HoeffdingAdaptiveTreeRegressor
from spotRiver.data.river_hyper_dict import RiverHyperDict
from spotPython.hyperparameters.values import add_core_model_to_fun_control
core_model  = HoeffdingAdaptiveTreeRegressor
fun_control = add_core_model_to_fun_control(core_model=core_model,
                              fun_control=fun_control,
                              hyper_dict=RiverHyperDict,
                              filename=None)

The corresponding entries for the core_model class are shown below.

fun_control['core_model_hyper_dict']
{'grace_period': {'type': 'int',
  'default': 200,
  'transform': 'None',
  'lower': 10,
  'upper': 1000},
 'max_depth': {'type': 'int',
  'default': 20,
  'transform': 'transform_power_2_int',
  'lower': 2,
  'upper': 20},
 'delta': {'type': 'float',
  'default': 1e-07,
  'transform': 'None',
  'lower': 1e-08,
  'upper': 1e-06},
 'tau': {'type': 'float',
  'default': 0.05,
  'transform': 'None',
  'lower': 0.01,
  'upper': 0.1},
 'leaf_prediction': {'levels': ['mean', 'model', 'adaptive'],
  'type': 'factor',
  'default': 'mean',
  'transform': 'None',
  'core_model_parameter_type': 'str',
  'lower': 0,
  'upper': 2},
 'leaf_model': {'levels': ['LinearRegression', 'PARegressor', 'Perceptron'],
  'type': 'factor',
  'default': 'LinearRegression',
  'transform': 'None',
  'class_name': 'river.linear_model',
  'core_model_parameter_type': 'instance()',
  'lower': 0,
  'upper': 2},
 'model_selector_decay': {'type': 'float',
  'default': 0.95,
  'transform': 'None',
  'lower': 0.9,
  'upper': 0.99},
 'splitter': {'levels': ['EBSTSplitter', 'TEBSTSplitter', 'QOSplitter'],
  'type': 'factor',
  'default': 'EBSTSplitter',
  'transform': 'None',
  'class_name': 'river.tree.splitter',
  'core_model_parameter_type': 'instance()',
  'lower': 0,
  'upper': 2},
 'min_samples_split': {'type': 'int',
  'default': 5,
  'transform': 'None',
  'lower': 2,
  'upper': 10},
 'bootstrap_sampling': {'levels': [0, 1],
  'type': 'factor',
  'default': 0,
  'transform': 'None',
  'core_model_parameter_type': 'bool',
  'lower': 0,
  'upper': 1},
 'drift_window_threshold': {'type': 'int',
  'default': 300,
  'transform': 'None',
  'lower': 100,
  'upper': 500},
 'switch_significance': {'type': 'float',
  'default': 0.05,
  'transform': 'None',
  'lower': 0.01,
  'upper': 0.1},
 'binary_split': {'levels': [0, 1],
  'type': 'factor',
  'default': 0,
  'transform': 'None',
  'core_model_parameter_type': 'bool',
  'lower': 0,
  'upper': 1},
 'max_size': {'type': 'float',
  'default': 500.0,
  'transform': 'None',
  'lower': 100.0,
  'upper': 1000.0},
 'memory_estimate_period': {'type': 'int',
  'default': 1000000,
  'transform': 'None',
  'lower': 100000,
  'upper': 1000000},
 'stop_mem_management': {'levels': [0, 1],
  'type': 'factor',
  'default': 0,
  'transform': 'None',
  'core_model_parameter_type': 'bool',
  'lower': 0,
  'upper': 1},
 'remove_poor_attrs': {'levels': [0, 1],
  'type': 'factor',
  'default': 0,
  'transform': 'None',
  'core_model_parameter_type': 'bool',
  'lower': 0,
  'upper': 1},
 'merit_preprune': {'levels': [0, 1],
  'type': 'factor',
  'default': 0,
  'transform': 'None',
  'core_model_parameter_type': 'bool',
  'lower': 0,
  'upper': 1}}

13.6 Step 6: Modify hyper_dict Hyperparameters for the Selected Algorithm aka core_model

13.6.1 Modify hyperparameter of type factor

# fun_control = modify_hyper_parameter_levels(fun_control, "leaf_model", ["LinearRegression"])
# fun_control["core_model_hyper_dict"]

13.6.2 Modify hyperparameter of type numeric and integer (boolean)

from spotPython.hyperparameters.values import modify_hyper_parameter_bounds
fun_control = modify_hyper_parameter_bounds(fun_control, "delta", bounds=[1e-10, 1e-6])
# fun_control = modify_hyper_parameter_bounds(fun_control, "min_samples_split", bounds=[3, 20])
fun_control = modify_hyper_parameter_bounds(fun_control, "merit_preprune", [0, 0])

13.7 Step 7: Selection of the Objective (Loss) Function

There are three metrics:

1. `metric_river` is used for the river based evaluation via `eval_oml_iter_progressive`.
2. `metric_sklearn` is used for the sklearn based evaluation via `eval_oml_horizon`.
3. `metric_torch` is used for the pytorch based evaluation.
import numpy as np
from river import metrics
from sklearn.metrics import mean_absolute_error


from spotRiver.fun.hyperriver import HyperRiver
fun = HyperRiver(seed=123, log_level=50).fun_oml_horizon
weights = np.array([1, 1/1000, 1/1000])*10_000.0
horizon = 7*24
oml_grace_period = 2
step = 100
weight_coeff = 1.0

fun_control.update({
               "horizon": horizon,
               "oml_grace_period": oml_grace_period,
               "weights": weights,
               "step": step,
               "log_level": 50,
               "weight_coeff": weight_coeff,
               "metric_river": metrics.MAE(),
               "metric_sklearn": mean_absolute_error
               })

13.8 Step 8: Calling the SPOT Function

13.8.1 Prepare the SPOT Parameters

  • Get types and variable names as well as lower and upper bounds for the hyperparameters.
from spotPython.hyperparameters.values import (
    get_var_type,
    get_var_name,
    get_bound_values
    )
var_type = get_var_type(fun_control)
var_name = get_var_name(fun_control)
fun_control.update({"var_type": var_type,
                    "var_name": var_name})

lower = get_bound_values(fun_control, "lower")
upper = get_bound_values(fun_control, "upper")
from spotPython.utils.eda import gen_design_table
print(gen_design_table(fun_control))
| name                   | type   | default          |      lower |    upper | transform             |
|------------------------|--------|------------------|------------|----------|-----------------------|
| grace_period           | int    | 200              |     10     | 1000     | None                  |
| max_depth              | int    | 20               |      2     |   20     | transform_power_2_int |
| delta                  | float  | 1e-07            |      1e-10 |    1e-06 | None                  |
| tau                    | float  | 0.05             |      0.01  |    0.1   | None                  |
| leaf_prediction        | factor | mean             |      0     |    2     | None                  |
| leaf_model             | factor | LinearRegression |      0     |    2     | None                  |
| model_selector_decay   | float  | 0.95             |      0.9   |    0.99  | None                  |
| splitter               | factor | EBSTSplitter     |      0     |    2     | None                  |
| min_samples_split      | int    | 5                |      2     |   10     | None                  |
| bootstrap_sampling     | factor | 0                |      0     |    1     | None                  |
| drift_window_threshold | int    | 300              |    100     |  500     | None                  |
| switch_significance    | float  | 0.05             |      0.01  |    0.1   | None                  |
| binary_split           | factor | 0                |      0     |    1     | None                  |
| max_size               | float  | 500.0            |    100     | 1000     | None                  |
| memory_estimate_period | int    | 1000000          | 100000     |    1e+06 | None                  |
| stop_mem_management    | factor | 0                |      0     |    1     | None                  |
| remove_poor_attrs      | factor | 0                |      0     |    1     | None                  |
| merit_preprune         | factor | 0                |      0     |    0     | None                  |

13.8.2 Run the Spot Optimizer

  • Run SPOT for approx. x mins (max_time).
  • Note: the run takes longer, because the evaluation time of initial design (here: initi_size, 20 points) is not considered.
from spotPython.hyperparameters.values import get_default_hyperparameters_as_array
hyper_dict=RiverHyperDict().load()
X_start = get_default_hyperparameters_as_array(fun_control, hyper_dict)
from spotPython.spot import spot
from math import inf
import numpy as np
spot_tuner = spot.Spot(fun=fun,
                   lower = lower,
                   upper = upper,
                   fun_evals = inf,
                   fun_repeats = 1,
                   max_time = MAX_TIME,
                   noise = False,
                   tolerance_x = np.sqrt(np.spacing(1)),
                   var_type = var_type,
                   var_name = var_name,
                   infill_criterion = "y",
                   n_points = 1,
                   seed=123,
                   log_level = 50,
                   show_models= False,
                   show_progress= True,
                   fun_control = fun_control,
                   design_control={"init_size": INIT_SIZE,
                                   "repeats": 1},
                   surrogate_control={"noise": True,
                                      "cod_type": "norm",
                                      "min_theta": -4,
                                      "max_theta": 3,
                                      "n_theta": len(var_name),
                                      "model_fun_evals": 10_000,
                                      "log_level": 50
                                      })
spot_tuner.run(X_start=X_start)
spotPython tuning: 2.128878913065623 [####------] 40.66% 
spotPython tuning: 2.128878913065623 [######----] 64.18% 
spotPython tuning: 2.098625936740872 [########--] 81.83% 
spotPython tuning: 2.098625936740872 [##########] 100.00% Done...
<spotPython.spot.spot.Spot at 0x2a4bcab90>

13.9 Step 9: Results

import pickle
SAVE = False
LOAD = False

if SAVE:
    result_file_name = "res_" + experiment_name + ".pkl"
    with open(result_file_name, 'wb') as f:
        pickle.dump(spot_tuner, f)

if LOAD:
    result_file_name = "res_ch10-friedman-hpt-0_maans03_60min_20init_1K_2023-04-14_10-11-19.pkl"
    with open(result_file_name, 'rb') as f:
        spot_tuner =  pickle.load(f)
  • Show the Progress of the hyperparameter tuning:
spot_tuner.plot_progress(log_y=True, filename="./figures/" + experiment_name+"_progress.pdf")

  • Print the Results
print(gen_design_table(fun_control=fun_control, spot=spot_tuner))
| name                   | type   | default          |    lower |     upper |             tuned | transform             |   importance | stars   |
|------------------------|--------|------------------|----------|-----------|-------------------|-----------------------|--------------|---------|
| grace_period           | int    | 200              |     10.0 |    1000.0 |             206.0 | None                  |         0.00 |         |
| max_depth              | int    | 20               |      2.0 |      20.0 |              19.0 | transform_power_2_int |         0.00 |         |
| delta                  | float  | 1e-07            |    1e-10 |     1e-06 |             1e-06 | None                  |         0.00 |         |
| tau                    | float  | 0.05             |     0.01 |       0.1 |               0.1 | None                  |         0.00 |         |
| leaf_prediction        | factor | mean             |      0.0 |       2.0 |               2.0 | None                  |         0.00 |         |
| leaf_model             | factor | LinearRegression |      0.0 |       2.0 |               0.0 | None                  |         2.56 | *       |
| model_selector_decay   | float  | 0.95             |      0.9 |      0.99 |              0.99 | None                  |         0.00 |         |
| splitter               | factor | EBSTSplitter     |      0.0 |       2.0 |               2.0 | None                  |       100.00 | ***     |
| min_samples_split      | int    | 5                |      2.0 |      10.0 |               5.0 | None                  |         0.00 |         |
| bootstrap_sampling     | factor | 0                |      0.0 |       1.0 |               0.0 | None                  |         0.00 |         |
| drift_window_threshold | int    | 300              |    100.0 |     500.0 |             123.0 | None                  |         0.00 |         |
| switch_significance    | float  | 0.05             |     0.01 |       0.1 |              0.01 | None                  |         0.00 |         |
| binary_split           | factor | 0                |      0.0 |       1.0 |               0.0 | None                  |         0.00 |         |
| max_size               | float  | 500.0            |    100.0 |    1000.0 | 130.8922869390374 | None                  |         0.00 |         |
| memory_estimate_period | int    | 1000000          | 100000.0 | 1000000.0 |          991680.0 | None                  |         0.00 |         |
| stop_mem_management    | factor | 0                |      0.0 |       1.0 |               0.0 | None                  |         0.00 |         |
| remove_poor_attrs      | factor | 0                |      0.0 |       1.0 |               1.0 | None                  |         0.29 | .       |
| merit_preprune         | factor | 0                |      0.0 |       0.0 |               0.0 | None                  |         0.00 |         |

13.9.1 Show variable importance

spot_tuner.plot_importance(threshold=0.0025, filename="./figures/" + experiment_name+"_importance.pdf")

13.9.2 Build and Evaluate HTR Model with Tuned Hyperparameters

m = test.shape[0]
a = int(m/2)-50
b = int(m/2)

13.9.3 The Large Data Set (k=0.2)

Caution: Increased Friedman-Drift Data Set
  • The Friedman-Drift Data Set is increased by a factor of two to show the transferability of the hyperparameter tuning results.
  • Larger values of k lead to a longer run time.
horizon = 7*24
k = .2
n_total = int(k*100_000)
n_samples = n_total
p_1 = int(k*25_000)
p_2 = int(k*50_000)
position=(p_1, p_2)
n_train = 1_000
a = n_train + p_1 - 12
b = a + 12
dataset = synth.FriedmanDrift(
   drift_type='gra',
   position=position,
     seed=123
)
data_dict = {key: [] for key in list(dataset.take(1))[0][0].keys()}
data_dict["y"] = []
for x, y in dataset.take(n_total):
    for key, value in x.items():
        data_dict[key].append(value)
    data_dict["y"].append(y)
df = pd.DataFrame(data_dict)
# Add column names x1 until x10 to the first 10 columns of the dataframe and the column name y to the last column
df.columns = [f"x{i}" for i in range(1, 11)] + ["y"]

train = df[:n_train]
test = df[n_train:]
target_column = "y"
#
fun_control.update({"data": None, # dataset,
               "train": train,
               "test": test,
               "n_samples": n_samples,
               "target_column": target_column})

13.9.4 Get Default Hyperparameters

# fun_control was modified, we generate a new one with the original 
# default hyperparameters
from spotPython.hyperparameters.values import get_one_core_model_from_X
fc = fun_control
fc.update({"core_model_hyper_dict":
    hyper_dict[fun_control["core_model"].__name__]})
model_default = get_one_core_model_from_X(X_start, fun_control=fc)
model_default
HoeffdingAdaptiveTreeRegressor
HoeffdingAdaptiveTreeRegressor ( grace_period=200 max_depth=1048576 delta=1e-07 tau=0.05 leaf_prediction="mean" leaf_model=LinearRegression ( optimizer=SGD ( lr=Constant ( learning_rate=0.01 ) ) loss=Squared () l2=0. l1=0. intercept_init=0. intercept_lr=Constant ( learning_rate=0.01 ) clip_gradient=1e+12 initializer=Zeros () ) model_selector_decay=0.95 nominal_attributes=None splitter=EBSTSplitter () min_samples_split=5 bootstrap_sampling=0 drift_window_threshold=300 drift_detector=ADWIN ( delta=0.002 clock=32 max_buckets=5 min_window_length=5 grace_period=10 ) switch_significance=0.05 binary_split=0 max_size=500. memory_estimate_period=1000000 stop_mem_management=0 remove_poor_attrs=0 merit_preprune=0 seed=None )
from spotRiver.evaluation.eval_bml import eval_oml_horizon

df_eval_default, df_true_default = eval_oml_horizon(
                    model=model_default,
                    train=fun_control["train"],
                    test=fun_control["test"],
                    target_column=fun_control["target_column"],
                    horizon=fun_control["horizon"],
                    oml_grace_period=fun_control["oml_grace_period"],
                    metric=fun_control["metric_sklearn"],
                )
from spotRiver.evaluation.eval_bml import plot_bml_oml_horizon_metrics, plot_bml_oml_horizon_predictions
df_labels=["default"]
plot_bml_oml_horizon_metrics(df_eval = [df_eval_default], log_y=False, df_labels=df_labels, metric=fun_control["metric_sklearn"])
plot_bml_oml_horizon_predictions(df_true = [df_true_default[a:b]], target_column=target_column,  df_labels=df_labels)

13.9.5 Get SPOT Results

from spotPython.hyperparameters.values import get_one_core_model_from_X
X = spot_tuner.to_all_dim(spot_tuner.min_X.reshape(1,-1))
model_spot = get_one_core_model_from_X(X, fun_control)
model_spot
HoeffdingAdaptiveTreeRegressor
HoeffdingAdaptiveTreeRegressor ( grace_period=206 max_depth=524288 delta=1e-06 tau=0.1 leaf_prediction="adaptive" leaf_model=LinearRegression ( optimizer=SGD ( lr=Constant ( learning_rate=0.01 ) ) loss=Squared () l2=0. l1=0. intercept_init=0. intercept_lr=Constant ( learning_rate=0.01 ) clip_gradient=1e+12 initializer=Zeros () ) model_selector_decay=0.99 nominal_attributes=None splitter=QOSplitter ( radius=0.25 allow_multiway_splits=False ) min_samples_split=5 bootstrap_sampling=0 drift_window_threshold=123 drift_detector=ADWIN ( delta=0.002 clock=32 max_buckets=5 min_window_length=5 grace_period=10 ) switch_significance=0.01 binary_split=0 max_size=130.892287 memory_estimate_period=991680 stop_mem_management=0 remove_poor_attrs=1 merit_preprune=0 seed=None )
df_eval_spot, df_true_spot = eval_oml_horizon(
                    model=model_spot,
                    train=fun_control["train"],
                    test=fun_control["test"],
                    target_column=fun_control["target_column"],
                    horizon=fun_control["horizon"],
                    oml_grace_period=fun_control["oml_grace_period"],
                    metric=fun_control["metric_sklearn"],
                )
df_labels=["default", "spot"]
plot_bml_oml_horizon_metrics(df_eval = [df_eval_default, df_eval_spot], log_y=False, df_labels=df_labels, metric=fun_control["metric_sklearn"], filename="./figures/" + experiment_name+"_metrics.pdf")

a = int(m/2)+20
b = int(m/2)+50
plot_bml_oml_horizon_predictions(df_true = [df_true_default[a:b], df_true_spot[a:b]], target_column=target_column,  df_labels=df_labels, filename="./figures/" + experiment_name+"_predictions.pdf")

from spotPython.plot.validation import plot_actual_vs_predicted
plot_actual_vs_predicted(y_test=df_true_default["y"], y_pred=df_true_default["Prediction"], title="Default")
plot_actual_vs_predicted(y_test=df_true_spot["y"], y_pred=df_true_spot["Prediction"], title="SPOT")

13.9.6 Visualize Regression Trees

dataset_f = dataset.take(n_total)
for x, y in dataset_f:
    model_default.learn_one(x, y)
Caution: Large Trees
  • Since the trees are large, the visualization is suppressed by default.
  • To visualize the trees, uncomment the following line.
# model_default.draw()
model_default.summary
{'n_nodes': 35,
 'n_branches': 17,
 'n_leaves': 18,
 'n_active_leaves': 96,
 'n_inactive_leaves': 0,
 'height': 6,
 'total_observed_weight': 39002.0,
 'n_alternate_trees': 21,
 'n_pruned_alternate_trees': 6,
 'n_switch_alternate_trees': 2}

13.9.7 Spot Model

dataset_f = dataset.take(n_total)
for x, y in dataset_f:
    model_spot.learn_one(x, y)
Caution: Large Trees
  • Since the trees are large, the visualization is suppressed by default.
  • To visualize the trees, uncomment the following line.
# model_spot.draw()
model_spot.summary
{'n_nodes': 205,
 'n_branches': 102,
 'n_leaves': 103,
 'n_active_leaves': 106,
 'n_inactive_leaves': 0,
 'height': 22,
 'total_observed_weight': 39002.0,
 'n_alternate_trees': 40,
 'n_pruned_alternate_trees': 39,
 'n_switch_alternate_trees': 0}
from spotPython.utils.eda import compare_two_tree_models
print(compare_two_tree_models(model_default, model_spot))
| Parameter                |   Default |   Spot |
|--------------------------|-----------|--------|
| n_nodes                  |        35 |    205 |
| n_branches               |        17 |    102 |
| n_leaves                 |        18 |    103 |
| n_active_leaves          |        96 |    106 |
| n_inactive_leaves        |         0 |      0 |
| height                   |         6 |     22 |
| total_observed_weight    |     39002 |  39002 |
| n_alternate_trees        |        21 |     40 |
| n_pruned_alternate_trees |         6 |     39 |
| n_switch_alternate_trees |         2 |      0 |
min(spot_tuner.y), max(spot_tuner.y)
(2.098625936740872, 13.363333596188284)

13.9.8 Detailed Hyperparameter Plots

filename = "./figures/" + experiment_name
spot_tuner.plot_important_hyperparameter_contour(filename=filename)
leaf_model:  2.5635330719992817
splitter:  100.0
remove_poor_attrs:  0.2945517088142445

13.9.9 Parallel Coordinates Plots

spot_tuner.parallel_plot()

13.9.10 Plot all Combinations of Hyperparameters

  • Warning: this may take a while.
PLOT_ALL = False
if PLOT_ALL:
    n = spot_tuner.k
    for i in range(n-1):
        for j in range(i+1, n):
            spot_tuner.plot_contour(i=i, j=j, min_z=min_z, max_z = max_z)