--- title: Lesson 3 - Cross-Validation keywords: fastai sidebar: home_sidebar nb_path: "nbs/course2020/vision/03_Cross_Validation.ipynb" ---
{% raw %}
{% endraw %}

Lesson Video:

{% raw %}
{% endraw %} {% raw %}

This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, and wwf currently running at the time of writing this:

  • fastai: 2.1.10
  • fastcore: 1.3.13
  • wwf: 0.0.5

{% endraw %}

What is K-Fold Cross Validation?

  • A way to get the most out of your data
  • More models
  • Ensembling
  • Requires more training

What is needed?

  • Training set
  • Test set

  • Why no validation?

Importing the Library

We will be doing a vision task so we'll import the vision library

{% raw %}
from fastai.vision.all import *
{% endraw %}

Below you will find the exact imports for everything we use today

{% raw %}
from fastcore.foundation import L

from fastai.callback.fp16 import to_fp16
from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import fit_one_cycle

from fastai.data.core import Datasets, show_at
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import IntToFloatTensor, Normalize, ToTensor, IndexSplitter, get_image_files, parent_label, Categorize

from fastai.metrics import accuracy

from fastai.vision.augment import aug_transforms, RandomResizedCrop
from fastai.vision.core import PILImage, imagenet_stats
from fastai.vision.learner import cnn_learner

import random

from sklearn.model_selection import StratifiedKFold

from torchvision.models.resnet import resnet34
{% endraw %}

ImageWoof

{% raw %}
path = untar_data(URLs.IMAGEWOOF)
{% endraw %} {% raw %}
path.ls()
(#2) [Path('/home/ml1/.fastai/data/imagewoof2/val'),Path('/home/ml1/.fastai/data/imagewoof2/train')]
{% endraw %}

Scenario:

  • We have a training set
  • We have a test set
{% raw %}
item_tfms = [ToTensor(), RandomResizedCrop(460, min_scale=0.75, ratio=(1.,1.))]
batch_tfms = [IntToFloatTensor(), *aug_transforms(size=224, max_warp=0), Normalize.from_stats(*imagenet_stats)]
bs=64
{% endraw %}

We'll use the IndexSplitter just to get to know it. What we really wind up doing is a RandomSplitter split 80/20.

We can see IndexSplitter's source code by doing:

{% raw %}
IndexSplitter??
{% endraw %}

Next let's get our images

{% raw %}
train_imgs = get_image_files(path/'train')
tst_imgs = get_image_files(path/'val')
{% endraw %}

We'll shuffle up our training set so the chance of including all classes is almost guarenteed

{% raw %}
random.shuffle(train_imgs)
{% endraw %} {% raw %}
len(train_imgs)
9025
{% endraw %}

And then we will do the 80/20 split

{% raw %}
train_imgs
(#9025) [Path('/root/.fastai/data/imagewoof2/train/n02093754/n02093754_476.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02086240/n02086240_573.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02089973/n02089973_663.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02088364/n02088364_13644.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02111889/n02111889_4147.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02093754/n02093754_594.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02087394/n02087394_12539.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02089973/n02089973_9145.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02115641/n02115641_11714.JPEG'),Path('/root/.fastai/data/imagewoof2/train/n02088364/n02088364_12304.JPEG')...]
{% endraw %} {% raw %}
start_val = len(train_imgs) - int(len(train_imgs)*.2)
idxs = list(range(start_val, len(train_imgs)))
splitter = IndexSplitter(idxs)
splits = splitter(train_imgs)
{% endraw %}

Since we want to include our test set in with these splits, we'll make a split_list of all three of our splits (train, valid, test)

{% raw %}
split_list = [splits[0], splits[1]]
{% endraw %}

And we'll add in the range for our test set here:

{% raw %}
split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
{% endraw %} {% raw %}
split_list
[(#7220) [0,1,2,3,4,5,6,7,8,9...],
 (#1805) [7220,7221,7222,7223,7224,7225,7226,7227,7228,7229...],
 (#3929) [9025,9026,9027,9028,9029,9030,9031,9032,9033,9034...]]
{% endraw %}

Let's check that everything worked as intended. First building the Datasets:

{% raw %}
dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
                splits = split_list)
{% endraw %}

We can look at an item:

{% raw %}
show_at(dsrc.train, 3)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b121ee438>
{% endraw %}

And if we check n_subsets, we can see that three are there (for our three splits)

{% raw %}
dsrc.n_subsets
3
{% endraw %}

Now let's build some DataLoaders

{% raw %}
dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
{% endraw %} {% raw %}
dls.show_batch()
{% endraw %}

We can see the subsets was passed down to here as well:

{% raw %}
dls.n_subsets
3
{% endraw %}

What this means is while dls.train and dls.valid will return what we would expect, if we were to instead index into our DataLoader, we can find our testing data in there too:

{% raw %}
dls[2].show_batch()
{% endraw %}

Let's do a quick baseline

{% raw %}
learn = cnn_learner(dls, resnet34, pretrained=False, metrics=accuracy).to_fp16()
{% endraw %} {% raw %}
learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 2.827092 2.138509 0.223269 01:33
{% endraw %}

Now how do we check it?

We can run learn.validate on our subset

{% raw %}
learn.validate(ds_idx=2)
(#2) [2.11087965965271,0.21252226829528809]
{% endraw %}

Now how do I do Cross-Validation?

First let's import our KFold

{% raw %}
from sklearn.model_selection import StratifiedKFold
{% endraw %}

And grab all the labels from our dataset

{% raw %}
train_labels = L(dsrc.items).map(dsrc.tfms[1])
{% endraw %}

Now let's make our K-Fold

{% raw %}
kf = StratifiedKFold(n_splits=5, shuffle=True)
{% endraw %}

Finally we need to define a training loop to go over all our folds and gather our validation and test accuracy

{% raw %}
n_splits = 10
{% endraw %} {% raw %}
import random
random.shuffle(train_imgs)
{% endraw %}

What's our loop going to look like?

{% raw %}
val_pct = []
tst_preds = []
skf = StratifiedKFold(n_splits=10, shuffle=True)
for _, val_idx in kf.split(np.array(train_imgs), train_labels):
  splits = IndexSplitter(val_idx)
  split = splits(train_imgs)
  split_list = [split[0], split[1]]
  split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
  dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
                  splits=split_list)
  dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
  learn = cnn_learner(dls, resnet34, pretrained=False, metrics=accuracy)
  learn.fit_one_cycle(1)
  val_pct.append(learn.validate()[1])
  a,b = learn.get_preds(ds_idx=2)
  tst_preds.append(a)
epoch train_loss valid_loss accuracy time
0 2.719507 2.042494 0.238095 01:31
epoch train_loss valid_loss accuracy time
0 2.751266 2.086211 0.227021 01:34
epoch train_loss valid_loss accuracy time
0 2.707963 2.138007 0.234773 01:37
epoch train_loss valid_loss accuracy time
0 2.796831 2.056918 0.256921 01:33
epoch train_loss valid_loss accuracy time
0 2.770414 2.128132 0.211517 01:35
epoch train_loss valid_loss accuracy time
0 2.797058 2.139611 0.211752 01:30
epoch train_loss valid_loss accuracy time
0 2.778126 2.101697 0.252772 01:29
epoch train_loss valid_loss accuracy time
0 2.709981 2.061131 0.258315 01:27
epoch train_loss valid_loss accuracy time
0 2.767529 2.067217 0.252772 01:26
epoch train_loss valid_loss accuracy time
0 2.787489 2.056555 0.252772 01:26
{% endraw %}

Now how do we combine all our predictions? We sum them all together then divide by our total (a voting ensemble is what this is referred to as)

First let's check the accuracy of one fold:

{% raw %}
tst_preds_copy = tst_preds.copy()
accuracy(tst_preds_copy[0], b)
TensorCategory(0.2627)
{% endraw %}

Then we can print out all the folds. We can see our highest accuracy on the test set was 26.27%

{% raw %}
for i in tst_preds_copy:
  print(accuracy(i, b))
TensorCategory(0.2627)
TensorCategory(0.2451)
TensorCategory(0.2349)
TensorCategory(0.2527)
TensorCategory(0.2349)
TensorCategory(0.2306)
TensorCategory(0.2403)
TensorCategory(0.2568)
TensorCategory(0.2451)
TensorCategory(0.2420)
{% endraw %}

Now let's perform our vote:

{% raw %}
hat = tst_preds[0]
for pred in tst_preds[1:]:
  hat += pred
{% endraw %} {% raw %}
hat
tensor([[0.3902, 1.5066, 0.9855,  ..., 0.2361, 0.2046, 2.4240],
        [0.4634, 1.7225, 0.9483,  ..., 0.2826, 0.1854, 1.1360],
        [0.2495, 2.1242, 0.4055,  ..., 0.0850, 0.1285, 1.2317],
        ...,
        [0.2474, 2.0897, 0.2683,  ..., 0.1040, 0.1051, 2.4141],
        [0.6661, 1.5541, 0.5160,  ..., 0.4528, 0.5402, 2.0298],
        [0.3856, 1.6319, 0.5222,  ..., 0.1828, 0.1399, 2.2385]])
{% endraw %} {% raw %}
hat /= len(tst_preds)
{% endraw %}

And see what our new accuracy is

{% raw %}
accuracy(hat, b)
TensorCategory(0.2899)
{% endraw %}

That's an improvement ~2.5% or so! Not bad!

Ensembling in this way can have a diminishing return, so finding the right number of folds to use is something you should try to figure out through trial and error on subsamples of your dataset first (or if on Kaggle, see what other folks are using for theirs too!)