Fitting model on imbalanced datasets and how to fight bias#

This example illustrates the problem induced by learning on datasets having imbalanced classes. Subsequently, we compare different approaches alleviating these negative effects.

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

Problem definition#

We are dropping the following features:

  • “fnlwgt”: this feature was created while studying the “adult” dataset. Thus, we will not use this feature which is not acquired during the survey.

  • “education-num”: it is encoding the same information than “education”. Thus, we are removing one of these 2 features.

from sklearn.datasets import fetch_openml

df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
df = df.drop(columns=["fnlwgt", "education-num"])

The “adult” dataset as a class ratio of about 3:1

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

This dataset is only slightly imbalanced. To better highlight the effect of learning from an imbalanced dataset, we will increase its ratio to 30:1

from imblearn.datasets import make_imbalance

ratio, pos_label = 30, ">50K"
df_res, y_res = make_imbalance(
    df,
    y,
    sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},
)
y_res.value_counts()
class
<=50K    37155
>50K      1238
Name: count, dtype: int64

We will use skore.evaluate to get an estimate of the test scores using cross-validation.

As a baseline, we could use a classifier which will always predict the majority class independently of the features provided.

import skore
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
report = skore.evaluate(dummy_clf, df_res, y_res, splitter=5, pos_label=pos_label)
report.metrics.summarize().frame()
/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
DummyClassifier
mean std
Metric
Accuracy 0.967755 0.000069
Precision 0.000000 0.000000
Recall 0.000000 0.000000
ROC AUC 0.500000 0.000000
Log loss 1.162244 0.002488
Brier score 0.032245 0.000069
Fit time (s) 0.006850 0.000149
Predict time (s) 0.000107 0.000085


Strategies to learn from an imbalanced dataset#

We will compare different strategies to learn from an imbalanced dataset by collecting all estimators and evaluating each using skore.evaluate to compute cross-validated metrics.

Dummy baseline#

Before to train a real machine learning model, we can store the results obtained with our DummyClassifier.

estimators = [("Dummy classifier", dummy_clf)]

Linear classifier baseline#

We use skrub.tabular_pipeline to create a machine learning pipeline with proper preprocessing automatically adapted to the estimator. For a LogisticRegression, it will automatically handle missing values, encode categorical columns, and scale numerical columns.

from sklearn.linear_model import LogisticRegression
from skrub import tabular_pipeline

lr_clf = tabular_pipeline(LogisticRegression(max_iter=1000))
estimators.append(("Logistic regression", lr_clf))

We can verify that something similar is happening with a tree-based model such as RandomForestClassifier. tabular_pipeline will automatically adapt the preprocessing for tree-based models (e.g. no scaling needed).

from sklearn.ensemble import RandomForestClassifier

rf_clf = tabular_pipeline(RandomForestClassifier(random_state=42, n_jobs=2))
estimators.append(("Random forest", rf_clf))

Use class_weight#

Most of the models in scikit-learn have a parameter class_weight. This parameter will affect the computation of the loss in linear model or the criterion in the tree-based model to penalize differently a false classification from the minority and majority class. We can set class_weight="balanced" such that the weight applied is inversely proportional to the class frequency. We test this parametrization in both linear model and tree-based model.

lr_clf_balanced = tabular_pipeline(
    LogisticRegression(max_iter=1000, class_weight="balanced")
)
estimators.append(("Logistic regression with balanced class weights", lr_clf_balanced))

rf_clf_balanced = tabular_pipeline(
    RandomForestClassifier(random_state=42, n_jobs=2, class_weight="balanced")
)
estimators.append(("Random forest with balanced class weights", rf_clf_balanced))

Resample the training set during learning#

Another way is to resample the training set by under-sampling or over-sampling some of the samples. imbalanced-learn provides some samplers to do such processing.

We need to use the imbalanced-learn pipeline to properly handle the samplers within the pipeline. We insert the sampler before the final estimator in the pipeline created by skrub.tabular_pipeline.

from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

# We extract the preprocessing steps and the estimator from the tabular
# pipeline and insert the sampler before the estimator.
lr_clf_undersampled = make_pipeline_with_sampler(
    *lr_clf[:-1], RandomUnderSampler(random_state=42), lr_clf[-1]
)
estimators.append(("Under-sampling + Logistic regression", lr_clf_undersampled))

rf_clf_undersampled = make_pipeline_with_sampler(
    *rf_clf[:-1], RandomUnderSampler(random_state=42), rf_clf[-1]
)
estimators.append(("Under-sampling + Random forest", rf_clf_undersampled))

Use of specific balanced algorithms from imbalanced-learn#

We already showed that random under-sampling can be effective on decision tree. However, instead of under-sampling once the dataset, one could under-sample the original dataset before to take a bootstrap sample. This is the base of the imblearn.ensemble.BalancedRandomForestClassifier and BalancedBaggingClassifier.

from imblearn.ensemble import BalancedRandomForestClassifier

brf_clf = tabular_pipeline(
    BalancedRandomForestClassifier(
        sampling_strategy="all",
        replacement=True,
        bootstrap=False,
        random_state=42,
        n_jobs=2,
    )
)
estimators.append(("Balanced random forest", brf_clf))
from sklearn.ensemble import HistGradientBoostingClassifier

from imblearn.ensemble import BalancedBaggingClassifier

bag_clf = tabular_pipeline(
    BalancedBaggingClassifier(
        estimator=HistGradientBoostingClassifier(random_state=42),
        n_estimators=10,
        random_state=42,
        n_jobs=2,
    )
)
estimators.append(("Balanced bag of histogram gradient boosting", bag_clf))

Now, we can use skore.evaluate to evaluate each estimator with cross-validation and collect all results in a single dataframe.

import pandas as pd

results = {}
for name, est in estimators:
    report = skore.evaluate(est, df_res, y_res, splitter=5, pos_label=pos_label)
    results[name] = report.metrics.summarize().frame()

df_results = pd.concat(results)
df_results
/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


/Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
DummyClassifier LogisticRegression RandomForestClassifier BalancedRandomForestClassifier BalancedBaggingClassifier
mean std mean std mean std mean std mean std
Metric
Dummy classifier Accuracy 0.967755 0.000069 NaN NaN NaN NaN NaN NaN NaN NaN
Precision 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
Recall 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
ROC AUC 0.500000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
Log loss 1.162244 0.002488 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
Balanced bag of histogram gradient boosting ROC AUC NaN NaN NaN NaN NaN NaN NaN NaN 0.917346 0.005819
Log loss NaN NaN NaN NaN NaN NaN NaN NaN 0.309216 0.004849
Brier score NaN NaN NaN NaN NaN NaN NaN NaN 0.103621 0.002158
Fit time (s) NaN NaN NaN NaN NaN NaN NaN NaN 6.692415 1.312802
Predict time (s) NaN NaN NaN NaN NaN NaN NaN NaN 0.441978 0.007606

72 rows × 10 columns



This last approach is the most effective. The different under-sampling allows to bring some diversity for the different GBDT to learn and not focus on a portion of the majority class.

Total running time of the script: (2 minutes 0.982 seconds)

Estimated memory usage: 755 MB

Gallery generated by Sphinx-Gallery