Distribute hard-to-classify datapoints over CV folds#

‘Instance hardness’ refers to the difficulty to classify an instance. The way hard-to-classify instances are distributed over train and test sets has significant effect on the test set performance metrics. In this example we show how to deal with this problem. We are making the comparison with normal StratifiedKFold cross-validation splitter.

# Authors: Frits Hermans, https://fritshermans.github.io
# License: MIT
print(__doc__)

Create an imbalanced dataset with instance hardness#

We create an imbalanced dataset with using scikit-learn’s make_blobs function and set the class imbalance ratio to 5%.

import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=[950, 50], centers=((-3, 0), (3, 0)), random_state=10)
_ = plt.scatter(X[:, 0], X[:, 1], c=y)
plot instance hardness cv

To introduce instance hardness in our dataset, we add some hard to classify samples:

X_hard, y_hard = make_blobs(
    n_samples=10, centers=((3, 0), (-3, 0)), cluster_std=1, random_state=10
)
X, y = np.vstack((X, X_hard)), np.hstack((y, y_hard))
_ = plt.scatter(X[:, 0], X[:, 1], c=y)
plot instance hardness cv

Compare cross validation scores using StratifiedKFold and InstanceHardnessCV#

Now, we want to assess a linear predictive model. Therefore, we should use cross-validation. The most important concept with cross-validation is to create training and test splits that are representative of the the data in production to have statistical results that one can expect in production.

By applying a standard StratifiedKFold cross-validation splitter, we do not control in which fold the hard-to-classify samples will be.

The InstanceHardnessCV splitter allows to control the distribution of the hard-to-classify samples over the folds.

Let’s make an experiment to compare the results that we get with both splitters. We use a LogisticRegression classifier and skore.evaluate to calculate the cross-validation scores.

import skore
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

from imblearn.model_selection import InstanceHardnessCV

logistic_regression = LogisticRegression()

splitters = {
    "StratifiedKFold": StratifiedKFold(n_splits=5, shuffle=True, random_state=10),
    "InstanceHardnessCV": InstanceHardnessCV(estimator=LogisticRegression()),
}

reports = {}
for name, cv in splitters.items():
    reports[name] = skore.evaluate(logistic_regression, X, y, splitter=cv)
import pandas as pd

results = {}
for name, report in reports.items():
    scores = report.metrics.summarize().frame()
    results[name] = scores
results = pd.concat(results)
results
LogisticRegression
mean std
Metric Label / Average
StratifiedKFold Accuracy 0.987129 0.005644
Precision 0 0.991742 0.008597
1 0.920476 0.088378
Recall 0 0.994764 0.006412
1 0.854545 0.152120
ROC AUC 0.954498 0.078860
Log loss 0.059978 0.035528
Brier score 0.011214 0.004682
Fit time (s) 0.002103 0.001512
Predict time (s) 0.000129 0.000136
InstanceHardnessCV Accuracy 0.988119 0.002711
Precision 0 0.992692 0.002838
1 0.905455 0.004979
Recall 0 0.994764 0.000000
1 0.872727 0.049793
ROC AUC 0.951404 0.028972
Log loss 0.056310 0.004036
Brier score 0.010882 0.000819
Fit time (s) 0.001204 0.000086
Predict time (s) 0.000063 0.000004


The InstanceHardnessCV splitter results in less variation of average precision than StratifiedKFold splitter. When doing hyperparameter tuning or feature selection using a wrapper method (like RFECV) this will give more stable results.

Total running time of the script: (0 minutes 17.899 seconds)

Estimated memory usage: 286 MB

Gallery generated by Sphinx-Gallery