This document is a tutorial for how to use sam to train a 'decent' model easily, and how to go from there, using the sam package.
The used data can be found on the NAS, as 'ChicagoWater.feather'
%load_ext autoreload
%autoreload 2
import warnings
from pathlib import Path
import os
# Hide some of the tensorflow warnings, only to make the notebook more readable
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# Hide the FutureWarnings thrown by tensorflow, to make the notebook more readable
warnings.simplefilter("ignore", category=FutureWarning)
import sam
from sam.models import SamQuantileMLP
from sam.preprocessing import sam_format_to_wide
import pandas as pd
import numpy as np
import tensorflow as tf
# Hide some of the tensorflow warnings, only to make the notebook more readable
tf.logging.set_verbosity(tf.logging.ERROR)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Set the folder to read data from
datafolder = Path('../data')
# We configure tensorflow to not take up all GPU memory. Useful when sharing a GPU with others
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
The first thing we do is read some data in 'SAM' format. This format is the following columns:
If your data is not in this format, try bringing it into this format. If your data is in wide format, you can use sam.preprocessing.wide_to_sam_format
. However, this is not neccecary, since we will shortly convert to wide format anyway. Unfortunately, since this preprocessing is very data-specific, we cannot give any specific pointers here.
We have to do some data specific preprocessing: we will decide to focus only on 'Oak Street Weather Station' and 'Ohio Street Beach'. This is obviously data-specific, so we will not dwell on this too long.
data = pd.read_feather(datafolder / 'ChicagoWater.feather')
# Data-specific preprocessing: for this demo, we just take one beach and one weather station
data = data.loc[data['ID'].isin(['Ohio Street Beach', 'Oak Street Weather Station'])]
# Data-specific processing: This type is almost only nans, so we just get rid of it
data = data.loc[data['TYPE'] != 'Transducer Depth']
# Here, we show the locations, and timestamps
# It looks like we have about 6 years of data, but for more in-depth insights, use the sam-exploration dashboard instead
display(data.TIME.min())
display(data.TIME.max())
display(data.ID.unique())
data.head()
The SamQuantileMLP does a lot of processing by itself, but there are a few things that (for now) need to be handled before inputting. In order:
The data needs to be in wide format. For this, we can use sam.preprocessing.sam_format_to_wide
.
The data needs to be monospaced. Our data is not monospaced yet, since the winters are missing.
pd.resample
. This method does not work if there are duplicate timestamps, in which case you can use sam.preprocessing.normalize_timestamp
X_train, X_test, y_train, y_test need to be split. In our case, we will use sklearn train_test_split
, and Ohio Street Beach_Water Temperature
as the target
The data needs to be imputed. The target does not need to be imputed, since the SamQuantileMLP will automatically throw away rows where the target is missing
# Step 1
data_wide = sam.preprocessing.sam_format_to_wide(data)
data_wide.head()
# Step 2
# For missing timestamps, we can choose to forwardfill or other 'smart' interpolation methods
# In our case we don't want to interpolate, we just want to ignore all the measurements in the winter, so we just fill then with nans
data_wide = (data_wide
.set_index('TIME') # resample requires a time index
.resample('1H') # Make sure the data is monospaced with a 1 hour indexs
.asfreq(None) # fill all new values with nans
.reset_index()
)
# Step 3
# Now let's create train/test set
targetname = 'Ohio Street Beach_Water Temperature'
target = data_wide[targetname]
data_wide = data_wide.drop(targetname, axis=1)
# Keep in mind to use `shuffle=False`, since we want the test to be *after* the train, not intermixed
X_train, X_test, y_train, y_test = train_test_split(data_wide, target, test_size=0.33, shuffle=False)
# Step 4
# Now we will impute X_train and X_test
# Again, this may be added to SamQuantileMLP in the future.
# This is an extremely basic imputation method. In a real project, consider a 'better' method
X_train = X_train.ffill().bfill()
X_test = X_test.ffill().bfill()
Here, we will create and fit a SamQuantileMLP with near-default parameters, but still explicitly mention and explain the parameters
Note that the model, by default, does differencing when predicting the future! This should not be noticable to the user, but it means that for predicting, the current value of the target is required, to be able to predict the future value.
model = SamQuantileMLP(
predict_ahead=5, # We will predict 5 timestep ahead. NOTE: you can also predict the present by setting this to 0, or predict multiple timesteps by passing a list
quantiles=(0.3, 0.7), # We will predict two quantiles. NOTE: by default, no quantiles are predicted
use_y_as_feature=True, # We will use y as a feature in our model
timecol='TIME', # We will use 'TIME' as a feature column. NOTE: by default, the model assumes there is no time column
time_components=['hour', 'weekday'], # We will use these components as features
time_cyclicals=['hour', 'weekday'], # We will use these components as cyclical features instead of regular
rolling_window_size=(4, 19), # We will use this window size to calculate rolling features. 19 makes sense because it is exactly a day before our prediction 5 timesteps ahead.
rolling_features=['lag'], # We will calculate these rolling features
# Note: For these parameters, see `sam.models.keras_quantile_mlp`
n_neurons=200, # Our MLP will use this many neurons
n_layers=2, # Our MLP will use this many layers
batch_size=16, # Our MLP will have this batch size
epochs=10, # Our MLP will use this many epochs
lr=0.001, # Our MLP will use this learning rate
dropout=0.5, # Our MLP will use this dropout rate (None for no dropout)
momentum=None, # Our MLP will use this batchnorm rate (None for no batchnorm )
verbose=True # Our MLP will have this verbosity
)
# Now fit the model. This method has an almost identical signature to the regular Keras Sequential fit method (https://keras.io/models/sequential/)
history = model.fit(X_train, y_train, validation_data = (X_test, y_test))
# Show the features passed to the keras model
print(model.get_feature_names())
# Note that the model.score method should give the same output as the val_loss above:
print("\nval_loss:", model.score(X_test, y_test))
Making a prediction is simple with model.predict(X_test, y_test)
. Note that, because we use y as a feature, and also do differencing, we need to pass the present value of y
to the predict function. Note that this is not 'cheating', because we are actually predicting the future value, not the present value.
Also note that the transformer will build rolling features from scratch. If X_test
is not large enough (for example, only 12 datapoints when you need a window of 24), all the predictions will be nan.
prediction = model.predict(X_test, y_test)
# Note that we are showing the tail, not the head. Due to incomplete rolling windows, the first couple of predictions are always nan
prediction.tail(5)
The prediction has three columns, pertaining to lead_5
. This means that for the datapoint in that row, when we predict 1 timestep ahead, our predictions are as such.
Of course, we might want to know more about this prediction, compare it to the true value, etcetera. We can do the following things:
X_test
and prediction
on index: They have the same index, and X_test
contains data and timestamp at which the prediction was made (not the timestamp that the prediction pertains to!)y_test
and prediction
on index. This is useful, for instance when you want to compare to the persistence benchmark. Simply predicting y_test
would be the same as the persistence benchmark.model.get_actual(y_test)
. This function will return a dataframe similar to prediction
, but with the 'true' values. This is used for evaluating the model.actual = model.get_actual(y_test)
# Note that we are predicting 5 timesteps ahead. Therefore the last timesteps have no 'actual' value: we have no data about what the 'true' value would be 1 step in the future.
# If we would be predicting multiple timesteps (for example: 1, 2 and 3 timesteps ahead), this function returns a dataframe instead of a series
actual.tail(10)
# Combine everything:
# We ignore the quantile predictions for now.
all_prediction_data = pd.DataFrame({
'TIME': X_test['TIME'],
'actual': model.get_actual(y_test),
'persistence': y_test,
'prediction': prediction['predict_lead_5_mean']
})
all_prediction_data.tail(10)
# Show the prediction vs the actual value
# There is not (yet) a single plotting solution
# So here, we just show a single way to plot a prediction
%matplotlib inline
(all_prediction_data
.loc[all_prediction_data['TIME'] > '2017-07-18 00:00:00']
.loc[all_prediction_data['TIME'] < '2017-07-23 00:00:00']
.set_index('TIME')
.loc[:, ['actual', 'prediction']]
).plot()
# Score the predictions
# Note that the performance isn't great yet due to the very basic feature engineering. Ask Fenno or Daan about progress on this, but better feature engineering is coming ASAP!
# Also, it may be worth tinkering with some of the parameters such as number of neurons, learning rate, dropout, etc. However, good feature engineering, from experience,
# has by far the biggest contribution to good performance.
# We dropna because mean_squared_error crashes if it gets passed nans
print('model MSE:')
print(mean_squared_error(all_prediction_data.dropna()['actual'], all_prediction_data.dropna()['prediction']))
print('persistence benchmark:')
print(mean_squared_error(all_prediction_data.dropna()['actual'], all_prediction_data.dropna()['persistence']))
In this section, we show how we can inspect the model and predictions in a bit more detail. Also, we show how to dump/load the model, and show a summary, because that didn't really fit in any other section
# Quantile Feature importance (ELI5)
# This will output a dataframe with n_iter rows, and a column for each column used in the model.
# The lower the score, the more important it is. If a certain column only has scores around 0, it is not a very important feature
# We only use 100 samples here. This is purely for performance reasons. The more data samples and the more iterations, the more accurate it will be.
score_decreases = model.quantile_feature_importances(X_test[-100:], y_test[-100:], n_iter=5)
# We can average the trials to get an approximate feature importance
feature_importances = score_decreases.mean().sort_values()
feature_importances
# Or we can show the score decreases with errorbars. Note that we only show the top 15 most importance features, since it would get too long otherwise
import seaborn
%matplotlib inline
chart = seaborn.barplot(data=score_decreases[feature_importances.index[:15]], orient='h')
# SHAP
%matplotlib inline
import shap
# Obtain a SHAP explainer that works on our (non-keras) model
explainer = model.get_explainer(X_test, y_test)
shap_values = explainer.shap_values(X_test[0:10], y_test[0:10])
test_values = explainer.test_values(X_test[0:10], y_test[0:10])
# After obtaining shap values, we can use the normal force plot
shap.force_plot(explainer.expected_value[0], shap_values[0][-1,:], test_values.iloc[-1,:], matplotlib=True)
from sam.utils import sum_grouped_columns
# Code taken straight from http://10.2.0.20/sam/utils.html?highlight=sum#sam.utils.sum_grouped_columns
# Used to sum the shap values that come from the same feature
df = pd.DataFrame(shap_values[0], columns = model.get_feature_names(), index = X_test[0:10].index)
summed_shaps = sum_grouped_columns(df)
empty_X = pd.DataFrame(np.full_like(summed_shaps, "", dtype=str), columns=summed_shaps.columns)
shap.force_plot(explainer.expected_value[0], summed_shaps.values[-1,:], empty_X.iloc[-1,:], matplotlib=True)
# We see that, even though battery life is not that important overall, for this particular sample it seems to be quite important.
# You can dump and load
model.dump(foldername='../output', prefix='mymodel')
model = SamQuantileMLP.load(foldername='../output', prefix='mymodel')
# You can view summary
model.summary()
In the previous sections, we saw the model performed well, but not amazingly, likely at least in part due to the feature engineering. Better feature engineering for the SAM model is in the pipeline (no pun intended), but in general, it may happen that for a project, you want to add your own features or perform feature engineering yourself. This can be done in two ways:
There is no consensus yet on what is the best option, so for now, choose what you want, and report back to the SAM team about your findings!
The second option may need some explanation, so we will give an example here. Note that this approach is versatile, and can be used to change SamQuantileMLP in any way you want.
# Here, we show an example of subclassing and changing the SAM model. In this case, we are actually making it worse.
# However, you can change the model in any way you want, even making it better!
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
class MyBetterSamModel(SamQuantileMLP):
def get_feature_engineer(self):
# Function that returns an Sklearn transformer
scaler = StandardScaler()
imputer = SimpleImputer()
return Pipeline([('scaler', scaler),
('imputer', imputer)])
# Note: When changing `get_feature_engineer`, this function should also be changed since they depends on the feature engineer.
# It is not obvious what the feature names should be in general, especially because a lot of transformers output
# numpy arrays. For that reason, we need an additional function to decide the feature names
def set_feature_names(self, X, X_transformed):
# Note that this function can access the fitted feature engineer with `self.feature_engineer_`
# In this case, we are not changing the columns, so just return X.columns
return X.columns
model = MyBetterSamModel(predict_ahead=5, timecol=None, epochs=5) # Our model cannot handle a time column, so we won't pass it
# We need to drop the timecol to prevent StandardScaler from crashing
history = model.fit(X_train.drop('TIME', axis=1), y_train,
validation_data = (X_test.drop('TIME', axis=1), y_test))
predictions = pd.DataFrame({
'actual': model.get_actual(y_test),
'persistence': y_test,
'prediction': model.predict(X_test.drop('TIME', axis=1), y_test) # note that since we are only predicting a single column, this will return a series
}).dropna()
print("Benchmark: ", mean_squared_error(predictions['actual'], predictions['persistence']))
print("Model: ", mean_squared_error(predictions['actual'], predictions['prediction']))
# As we can see, as expected, the model does worse than the default model (since we are doing no feature engineering)
We cannot use SamQuantileMLP with sam.models.create_keras_quantile_rnn
as a model instead of the default MLP. The reason for that is because lstm require 3d input.
If there is demand, expect SamQuantileRnn in the future!