--- title: Cross Validation (Intermediate) keywords: fastai sidebar: home_sidebar summary: "How to perform various Cross Validation methodologies" description: "How to perform various Cross Validation methodologies" nb_path: "nbs/03_tab.cv.ipynb" ---
{% raw %}
{% endraw %} {% raw %}

This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, scikit-learn, and iterative-stratification currently running at the time of writing this:

  • fastai: 2.0.14
  • fastcore: 1.0.14
  • scikit-learn: 0.22.2.post1
  • iterative-stratification: 0.1.6

{% endraw %}

Introduction

In this tutorial we will show how to use various cross validation methodologies inside of fastai with the tabular and vision libraries. First, let's walk through a tabular example

Tabular

Importing the Library and the Dataset

We'll be using the tabular module for the first example, along with the ADULTS dataset. Let's grab those:

{% raw %}
from fastai.tabular.all import *
{% endraw %} {% raw %}
path = untar_data(URLs.ADULT_SAMPLE)
{% endraw %}

Let's open it in Pandas:

{% raw %}
df = pd.read_csv(path/'adult.csv')
{% endraw %} {% raw %}
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
{% endraw %}

Next we want to create a constant test set and declare our various variables and procs. We'll just be using the last 10% of the data, however figuring out how to make your test set is a very important problem. To read more, see Rachel Thomas' article on How (and why) to create a good validation set. {% include note.html content='we call it a test set here as we make our own mini validation sets when we’re training' %}

{% raw %}
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
{% endraw %}

And now we'll split our dataset:

{% raw %}
print(f'10% of our data is {int(len(df) * .1)} rows')
10% of our data is 3256 rows
{% endraw %} {% raw %}
start_val = len(df) - 3256; start_val
29305
{% endraw %} {% raw %}
train = df.iloc[:start_val]
test = df.iloc[start_val:]
{% endraw %}

Now that we have the DataFrames, let's look into a few different CV methods:

K-Fold

Every Cross Validation method is slightly different, and what version you should use depends on the dataset you are utilizing. The general idea of Cross Validation is we split the dataset into n sets (usually five is enough), train five seperate models, and then at the end we can ensemble them together. This should in theory make a group of models that performs better than one model on the entire dataset.

As we are training, there is zero overlap in the validation sets whatsoever. As a result we create five distinct validation sets.

Introduction

Now for the kfold. We'll first be using sklearn's KFold class. This method works by running through all the indicies available and seperating out the folds. For a minimum example, take the following:

{% raw %}
train_idxs = list(range(0,9))
test_idxs = [10]
{% endraw %}

We now have some training indicies and a test set:

{% raw %}
train_idxs, test_idxs
([0, 1, 2, 3, 4, 5, 6, 7, 8], [10])
{% endraw %}

Now we can instantiate a KFold object, passing in the number of splits, whether to shuffle the data before splitting into folds, and potentially a seed:

{% raw %}
from sklearn.model_selection import KFold
{% endraw %} {% raw %}
dummy_kf = KFold(n_splits=5, shuffle=False); dummy_kf
KFold(n_splits=5, random_state=None, shuffle=False)
{% endraw %}

And now we can run through our splits by iterating through train and valid indexes. We pass in our x data through dummy_kf.split to get the indexes

You could also pass in your y's intead:

{% raw %}
for train_idx, valid_idx in dummy_kf.split(train_idxs):
    print(f'Train: {train_idx}, Valid: {valid_idx}')
Train: [2 3 4 5 6 7 8], Valid: [0 1]
Train: [0 1 4 5 6 7 8], Valid: [2 3]
Train: [0 1 2 3 6 7 8], Valid: [4 5]
Train: [0 1 2 3 4 5 8], Valid: [6 7]
Train: [0 1 2 3 4 5 6 7], Valid: [8]
{% endraw %}

Extra Preprocessing

Now the question is how can I use this when training on our data?

When we preprocess our tabular training dataset, we build our procs based upon it. When doing a CV (Cross Validation) we will often exclude some data as it gets pushed to the validation set, leading to such errors as:

{% raw %}
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-59d14331bcbc> in <module>()
      1 #hide_input
----> 2 raise AssertionError('nan values in `education-num` but not in setup training set')

AssertionError: nan values in `education-num` but not in setup training set
{% endraw %}

So how do we fix this? We should preprocess the entire training DataFrame into TabularPandas first, this way we can extract all the proc information. Let's do that now:

{% raw %}
to_base = TabularPandas(train, procs, cat_names, cont_names, y_names='salary')
{% endraw %}

Next we need to extract all the information we need. This includes:

{% raw %}
classes = to_base.classes
means, stds = to_base.normalize.means, to_base.normalize.stds
fill_vals, na_dict = to_base.fill_missing.fill_vals, to_base.fill_missing.na_dict
{% endraw %}

Now we could generate new procs based on those and apply them to our dataset:

{% raw %}
procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
{% endraw %}

Now Let's Train

Now that we have our adjusted procs, let's try training.

We'll want to make a loop that will do the following:

  1. Make our KFold and split
  2. Build a TabularPandas object given our splits
  3. Train for some training regiment
  4. Get predictions on the test set, and potentially keep track of any statistics.

Let's do so below:

{% raw %}
val_pct, tst_preds = L(), L()
kf = KFold(n_splits=5, shuffle=False)
for train_idx, valid_idx in kf.split(train.index):
    splits = (L(list(train_idx)), L(list(valid_idx)))
    procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
    to = TabularPandas(train, procs, cat_names, cont_names, y_names='salary',
                       splits=splits)
    dls = to.dataloaders(bs=512)
    learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
    learn.fit(3, 1e-2)
    test_dl = learn.dls.test_dl(test)
    with learn.no_bar():
        val_pct.append(learn.validate()[-1])
        tst_preds.append(learn.get_preds(dl=test_dl))
epoch train_loss valid_loss accuracy time
0 0.379017 0.380708 0.826992 00:00
1 0.364980 0.359392 0.832281 00:00
2 0.355631 0.361775 0.825456 00:00
epoch train_loss valid_loss accuracy time
0 0.382340 0.376627 0.829039 00:00
1 0.362212 0.366542 0.832111 00:00
2 0.355434 0.372222 0.830063 00:00
epoch train_loss valid_loss accuracy time
0 0.385911 0.374170 0.843542 00:00
1 0.368800 0.339751 0.842348 00:00
2 0.360772 0.349895 0.843542 00:00
epoch train_loss valid_loss accuracy time
0 0.377877 0.358854 0.835523 00:00
1 0.362264 0.362680 0.833646 00:00
2 0.355874 0.363413 0.833134 00:00
epoch train_loss valid_loss accuracy time
0 0.380469 0.358595 0.838423 00:00
1 0.363201 0.352324 0.837912 00:00
2 0.356388 0.350427 0.837741 00:00
{% endraw %}

Now let's take a look at our results:

{% raw %}
for i, (pred, truth) in enumerate(tst_preds):
    print(f'Fold {i+1}: {accuracy(pred, truth)}')
Fold 1: 0.8390663266181946
Fold 2: 0.834152340888977
Fold 3: 0.8320024609565735
Fold 4: 0.8356879353523254
Fold 5: 0.8329238295555115
{% endraw %}

Let's try ensembling them and seeing what happens:

{% raw %}
sum_preds = []
for i, (pred, truth) in enumerate(tst_preds):
    sum_preds.append(pred.numpy())
avg_preds = np.sum(sum_preds, axis=0) / 5
print(f'Average Accuracy: {accuracy(tensor(avg_preds), tst_preds[0][1])}')
Average Accuracy: 0.8366093635559082
{% endraw %}

As we can see, ensembling all the models together boosted our score by .1%. Not the highest of increases though! Let's try out another CV method and see if it works better

Stratified K-Fold

While the first example simply split our dataset either randomly (if we passed True) or just down the indicies, there are a multitude of cases where we won't have perfectly balanced classes (where the previous example would be useful). What can we do in such a situation?

Stratified K-Fold Validation allows us to split our data while also preserving the percentage of samples inside of each class. We'll follow the same methodology as we did before with a few minor changes to have it work with Stratified K-Fold

{% raw %}
from sklearn.model_selection import StratifiedKFold
{% endraw %}

The only difference is along with our train.index we also need to pass in our y's so it can gather the class distributions:

{% raw %}
val_pct, tst_preds = L(), L()
skf = StratifiedKFold(n_splits=5, shuffle=False)
for train_idx, valid_idx in kf.split(train.index, train['salary']): # right here
    splits = (L(list(train_idx)), L(list(valid_idx)))
    procs = [Categorify(classes), Normalize.from_tab(means, stds), FillMissing(fill_strategy=FillStrategy.median, fill_vals=fill_vals, na_dict=na_dict)]
    to = TabularPandas(train, procs, cat_names, cont_names, y_names='salary',
                       splits=splits)
    dls = to.dataloaders(bs=512)
    learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
    learn.fit(3, 1e-2)
    test_dl = learn.dls.test_dl(test)
    with learn.no_bar():
        val_pct.append(learn.validate()[-1])
        tst_preds.append(learn.get_preds(dl=test_dl))
epoch train_loss valid_loss accuracy time
0 0.377596 0.366456 0.831599 00:00
1 0.360850 0.361772 0.827674 00:00
2 0.356481 0.359992 0.831257 00:00
epoch train_loss valid_loss accuracy time
0 0.377417 0.388749 0.822726 00:00
1 0.360371 0.376890 0.824774 00:00
2 0.352614 0.368503 0.833817 00:00
epoch train_loss valid_loss accuracy time
0 0.387596 0.358673 0.842177 00:00
1 0.368236 0.347018 0.844907 00:00
2 0.362123 0.345612 0.841324 00:00
epoch train_loss valid_loss accuracy time
0 0.375481 0.365665 0.836205 00:00
1 0.358180 0.362090 0.832111 00:00
2 0.351982 0.360600 0.830404 00:00
epoch train_loss valid_loss accuracy time
0 0.385218 0.363116 0.831428 00:00
1 0.363915 0.349798 0.836717 00:00
2 0.356412 0.354061 0.837400 00:00
{% endraw %}

Let's see how our new version fairs up:

{% raw %}
for i, (pred, truth) in enumerate(tst_preds):
    print(f'Fold {i+1}: {accuracy(pred, truth)}')
Fold 1: 0.8335380554199219
Fold 2: 0.835073709487915
Fold 3: 0.8316953182220459
Fold 4: 0.8412162065505981
Fold 5: 0.8387592434883118
{% endraw %}

We can see that so far it looks a bit better (we actually have one with 84%!).

Now let's try the ensemble:

{% raw %}
sum_preds = []
for i, (pred, truth) in enumerate(tst_preds):
    sum_preds.append(pred.numpy())
avg_preds = np.sum(sum_preds, axis=0) / 5
print(f'Average Accuracy: {accuracy(tensor(avg_preds), tst_preds[0][1])}')
Average Accuracy: 0.835995078086853
{% endraw %}

Not quite as well in the ensemble (down by 0.1%), however I would trust this version much much more than the regular KFold.

Why?

Stratification ensures that we maintain the original distribution of our y values, ensuring that if we have rare classes they will always show up and be trained on. Now let's look at a multi-label example.

Multi-Label Stratified K-Fold

To run Multi-Label Stratified K-Fold, I will show an example below, but we will not run it (as there currently isn't quite a close enough dataset outside of Kaggle right now).

First we'll need to import our MultilabelStratifiedKfold from iterstrat:

{% raw %}
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
{% endraw %}

Then when following our above method (ensure you have your loss_func, etc properly setup), we simply replace our for train_idx, valid_idx with:

{% raw %}
mskf = MultilabelStratifiedKFold(n_splits=5)
for train_idx, val_idx in mskf.split(X=train, y=train[y_names]):
    "blah"
{% endraw %}