--- title: Cross Validation (Intermediate) keywords: fastai sidebar: home_sidebar summary: "How to perform various Cross Validation methodologies" description: "How to perform various Cross Validation methodologies" nb_path: "nbs/03_tab.cv.ipynb" ---
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
Let's open it in Pandas
:
df = pd.read_csv(path/'adult.csv')
df.head()
Next we want to create a constant test set and declare our various variables and procs
. We'll just be using the last 10% of the data, however figuring out how to make your test set is a very important problem. To read more, see Rachel Thomas' article on How (and why) to create a good validation set.
{% include note.html content='we call it a test set here as we make our own mini validation sets when we’re training' %}
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
And now we'll split our dataset:
print(f'10% of our data is {int(len(df) * .1)} rows')
start_val = len(df) - 3256; start_val
train = df.iloc[:start_val]
test = df.iloc[start_val:]
Now that we have the DataFrames
, let's look into a few different CV methods:
Every Cross Validation method is slightly different, and what version you should use depends on the dataset you are utilizing. The general idea of Cross Validation is we split the dataset into n
sets (usually five is enough), train five seperate models, and then at the end we can ensemble them together. This should in theory make a group of models that performs better than one model on the entire dataset.
As we are training, there is zero overlap in the validation sets whatsoever. As a result we create five distinct validation sets.
Now for the kfold
. We'll first be using sklearn
's KFold
class. This method works by running through all the indicies available and seperating out the folds. For a minimum example, take the following:
train_idxs = list(range(0,9))
test_idxs = [10]
We now have some training indicies and a test set:
train_idxs, test_idxs
Now we can instantiate a KFold
object, passing in the number of splits, whether to shuffle the data before splitting into folds, and potentially a seed:
from sklearn.model_selection import KFold
dummy_kf = KFold(n_splits=5, shuffle=False); dummy_kf
And now we can run through our splits by iterating through train and valid indexes. We pass in our x
data through dummy_kf.split
to get the indexes
You could also pass in your
y
's intead:
for train_idx, valid_idx in dummy_kf.split(train_idxs):
print(f'Train: {train_idx}, Valid: {valid_idx}')
Now the question is how can I use this when training on our data?
When we preprocess our tabular training dataset, we build our procs
based upon it. When doing a CV (Cross Validation) we will often exclude some data as it gets pushed to the validation set, leading to such errors as:
So how do we fix this? We should preprocess the entire training DataFrame
into TabularPandas
first, this way we can extract all the proc
information. Let's do that now:
to_base = TabularPandas(train, procs, cat_names, cont_names, y_names='salary')
Next we need to extract all the information we need. This includes:
Categorify
's classesNormalize
's means
and stds
FillMissing
's fill_vals
and na_dict
classes = to_base.classes
means, stds = to_base.normalize.means, to_base.normalize.stds
fill_vals, na_dict = to_base.fill_missing.fill_vals, to_base.fill_missing.na_dict
Now we could generate new procs based on those and apply them to our dataset:
procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
Now that we have our adjusted procs
, let's try training.
We'll want to make a loop that will do the following:
KFold
and splitTabularPandas
object given our splitstest
set, and potentially keep track of any statistics.Let's do so below:
val_pct, tst_preds = L(), L()
kf = KFold(n_splits=5, shuffle=False)
for train_idx, valid_idx in kf.split(train.index):
splits = (L(list(train_idx)), L(list(valid_idx)))
procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
to = TabularPandas(train, procs, cat_names, cont_names, y_names='salary',
splits=splits)
dls = to.dataloaders(bs=512)
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(3, 1e-2)
test_dl = learn.dls.test_dl(test)
with learn.no_bar():
val_pct.append(learn.validate()[-1])
tst_preds.append(learn.get_preds(dl=test_dl))
Now let's take a look at our results:
for i, (pred, truth) in enumerate(tst_preds):
print(f'Fold {i+1}: {accuracy(pred, truth)}')
Let's try ensembling them and seeing what happens:
sum_preds = []
for i, (pred, truth) in enumerate(tst_preds):
sum_preds.append(pred.numpy())
avg_preds = np.sum(sum_preds, axis=0) / 5
print(f'Average Accuracy: {accuracy(tensor(avg_preds), tst_preds[0][1])}')
As we can see, ensembling all the models together boosted our score by .1%. Not the highest of increases though! Let's try out another CV method and see if it works better
While the first example simply split our dataset either randomly (if we passed True
) or just down the indicies, there are a multitude of cases where we won't have perfectly balanced classes (where the previous example would be useful). What can we do in such a situation?
Stratified K-Fold Validation allows us to split our data while also preserving the percentage of samples inside of each class. We'll follow the same methodology as we did before with a few minor changes to have it work with Stratified K-Fold
from sklearn.model_selection import StratifiedKFold
The only difference is along with our train.index
we also need to pass in our y
's so it can gather the class distributions:
val_pct, tst_preds = L(), L()
skf = StratifiedKFold(n_splits=5, shuffle=False)
for train_idx, valid_idx in kf.split(train.index, train['salary']): # right here
splits = (L(list(train_idx)), L(list(valid_idx)))
procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
to = TabularPandas(train, procs, cat_names, cont_names, y_names='salary',
splits=splits)
dls = to.dataloaders(bs=512)
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(3, 1e-2)
test_dl = learn.dls.test_dl(test)
with learn.no_bar():
val_pct.append(learn.validate()[-1])
tst_preds.append(learn.get_preds(dl=test_dl))
Let's see how our new version fairs up:
for i, (pred, truth) in enumerate(tst_preds):
print(f'Fold {i+1}: {accuracy(pred, truth)}')
We can see that so far it looks a bit better (we actually have one with 84%!).
Now let's try the ensemble:
sum_preds = []
for i, (pred, truth) in enumerate(tst_preds):
sum_preds.append(pred.numpy())
avg_preds = np.sum(sum_preds, axis=0) / 5
print(f'Average Accuracy: {accuracy(tensor(avg_preds), tst_preds[0][1])}')
Not quite as well in the ensemble (down by 0.1%), however I would trust this version much much more than the regular KFold
.
Why?
Stratification ensures that we maintain the original distribution of our y
values, ensuring that if we have rare classes they will always show up and be trained on. Now let's look at a multi-label example.
To run Multi-Label Stratified K-Fold, I will show an example below, but we will not run it (as there currently isn't quite a close enough dataset outside of Kaggle right now).
First we'll need to import our MultilabelStratifiedKfold
from iterstrat
:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
Then when following our above method (ensure you have your loss_func
, etc properly setup), we simply replace our for train_idx, valid_idx
with:
mskf = MultilabelStratifiedKFold(n_splits=5)
for train_idx, val_idx in mskf.split(X=train, y=train[y_names]):
"blah"