Benchmark over-sampling methods in a face recognition task#

In this face recognition example two faces are used from the LFW (Faces in the Wild) dataset. Several implemented over-sampling methods are used in conjunction with a 3NN classifier in order to examine the improvement of the classifier’s output quality by using an over-sampler.

# Authors: Christos Aridas
#          Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

import seaborn as sns

sns.set_context("poster")

Load the dataset#

We will use a dataset containing image from know person where we will build a model to recognize the person on the image. We will make this problem a binary problem by taking picture of only George W. Bush and Bill Clinton.

import numpy as np
from sklearn.datasets import fetch_lfw_people

data = fetch_lfw_people()
george_bush_id = 1871  # Photos of George W. Bush
bill_clinton_id = 531  # Photos of Bill Clinton
classes = [george_bush_id, bill_clinton_id]
classes_name = np.array(["B. Clinton", "G.W. Bush"], dtype=object)

We can check the ratio between the two classes.

import matplotlib.pyplot as plt
import pandas as pd

class_distribution = pd.Series(y).value_counts(normalize=True)
ax = class_distribution.plot.barh()
ax.set_title("Class distribution")
pos_label = class_distribution.idxmin()
plt.tight_layout()
print(f"The positive label considered as the minority class is {pos_label}")
Class distribution
The positive label considered as the minority class is B. Clinton

We see that we have an imbalanced classification problem with ~95% of the data belonging to the class G.W. Bush.

Compare over-sampling approaches#

We will use different over-sampling approaches and use a kNN classifier to check if we can recognize the 2 presidents. The evaluation will be performed through cross-validation and we will plot the mean ROC curve using skore.evaluate.

We will create different pipelines and evaluate them.

import skore
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

from imblearn import FunctionSampler
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
from imblearn.pipeline import make_pipeline

classifier = KNeighborsClassifier(n_neighbors=3)

pipelines = {
    "No resampling": make_pipeline(FunctionSampler(), classifier),
    "Random Over-Sampler": make_pipeline(
        RandomOverSampler(random_state=42), classifier
    ),
    "ADASYN": make_pipeline(ADASYN(random_state=42), classifier),
    "SMOTE": make_pipeline(SMOTE(random_state=42), classifier),
}

We use skore.evaluate to evaluate each pipeline using a StratifiedKFold cross-validation and compare their performance.

cv = StratifiedKFold(n_splits=3)

reports = {}
for name, model in pipelines.items():
    reports[name] = skore.evaluate(model, X, y, splitter=cv, pos_label=pos_label)
import pandas as pd

results = {name: r.metrics.summarize().frame() for name, r in reports.items()}
pd.concat(results)
KNeighborsClassifier
mean std
Metric
No resampling Accuracy 0.949926 0.011066
Precision 0.488889 0.269430
Recall 0.203704 0.170088
ROC AUC 0.695430 0.102237
Log loss 30.684276 0.465169
Brier score 0.048489 0.007454
Fit time (s) 0.001147 0.000171
Predict time (s) 0.020389 0.001263
Random Over-Sampler Accuracy 0.905152 0.017512
Precision 0.271384 0.083465
Recall 0.477778 0.195316
ROC AUC 0.700296 0.094037
Log loss 31.329503 0.281080
Brier score 0.077141 0.009174
Fit time (s) 0.003459 0.000389
Predict time (s) 0.017630 0.001382
ADASYN Accuracy 0.695839 0.043894
Precision 0.121987 0.027832
Recall 0.785185 0.198865
ROC AUC 0.806093 0.085753
Log loss 20.463838 1.912165
Brier score 0.220684 0.035568
Fit time (s) 0.042784 0.024095
Predict time (s) 0.016614 0.001087
SMOTE Accuracy 0.713760 0.033024
Precision 0.116084 0.012309
Recall 0.685185 0.122894
ROC AUC 0.800637 0.080561
Log loss 20.476704 2.230097
Brier score 0.212727 0.027566
Fit time (s) 0.015151 0.005906
Predict time (s) 0.016362 0.001455


We can also plot the ROC curves for each pipeline.

fig, ax = plt.subplots(figsize=(9, 9))
for name, report in reports.items():
    report.metrics.roc().plot()
plt.show()
  • plot over sampling benchmark lfw
  • ROC Curve for KNeighborsClassifier Positive label: B. Clinton Data source: Test set
  • ROC Curve for KNeighborsClassifier Positive label: B. Clinton Data source: Test set
  • ROC Curve for KNeighborsClassifier Positive label: B. Clinton Data source: Test set
  • ROC Curve for KNeighborsClassifier Positive label: B. Clinton Data source: Test set

We see that for this task, methods that are generating new samples with some interpolation (i.e. ADASYN and SMOTE) perform better than random over-sampling or no resampling.

Total running time of the script: (0 minutes 23.170 seconds)

Estimated memory usage: 794 MB

Gallery generated by Sphinx-Gallery