Evaluating a Machine Learning model can be quite tricky. Usually, we split the data set into training and testing sets and use the training set to train the model and testing set to test the model. We then evaluate the model performance based on an error metric to determine the accuracy of the model. This method however, is not very reliable as the accuracy obtained for one test set can be very different to the accuracy obtained for a different test set _(as shown here)_. K-fold Cross Validation(CV) provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point.
Similarly to Issue #3, we want to investigate how much the performance score computed using cross-validation depends on the number of folds. Eg. how would our performance estimate change if we used 10-fold rather than 5-fold?
Write a function that takes a scikit-learn estimator(Logistic Regression) and a dataset, then compute an evaluation metric using repeated K-fold cross-validation over a grid of K values from 1 to n. It should output a table of K with the average metric value across the folds, one for each repeat. A prepliminary analysis of our data set (defaults.csv) can be referenced here
# Basic Computations
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Dynamic Markdowns
from IPython.display import Markdown as md
import warnings; warnings.simplefilter('ignore')
#**Adding module’s sub-directory to Pythons path**
import os
import sys
sys.path.insert(0, os.path.abspath('../elie_wanko/modules'))
import helpers, summary
df_data = pd.read_csv("defaults_data.csv")
df_data.head()
In our analysis, we are considering the following folds shown in the table below.
kf_splits = np.arange(2, 21)
print("\033[1m" + 'K-Folds : ' + str(set(kf_splits)) + "\033[0m")
cv_scores_summary = summary.cv_folds(data=df_data, sizes=kf_splits)
cv_scores_summary.style.apply(helpers.highlight_max)
plt.figure(figsize=(14, 7))
sns.lineplot(data=cv_scores_summary.filter(cv_scores_summary.columns[1:]))
# plt.xscale(5)
plt.title("Variation of scores at different K-Folds splits")
plt.xlabel("K-Folds Sizes")
plt.ylabel("Scores")
From our observations, it seems a variation in kfolds has slit effects on improving the metrics of our model on this data sets. Most probably because it was unable to converge. To improve, we could consider increase the number of iterations and use some post-processing to save the model when our performance scores reach a maximum and don't decrease after a x-terations.
Under the current results, our best model will be at 2 k-folds. Indeed, the variance of our accuracy_score and and precision_score is slightlt above 0.1, whereas recall and f1 differ by more 0.1.