Bagging classifiers using sampler#

In this example, we show how BalancedBaggingClassifier can be used to create a large variety of classifiers by giving different samplers.

We will give several examples that have been published in the passed year.

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

Generate an imbalanced dataset#

For this example, we will create a synthetic dataset using the function make_classification. The problem will be a toy classification problem with a ratio of 1:9 between the two classes.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10_000,
    n_features=10,
    weights=[0.1, 0.9],
    class_sep=0.5,
    random_state=0,
)
import pandas as pd

pd.Series(y).value_counts(normalize=True)
1    0.8977
0    0.1023
Name: proportion, dtype: float64

In the following sections, we will show a couple of algorithms that have been proposed over the years. We intend to illustrate how one can reuse the BalancedBaggingClassifier by passing different sampler.

We collect all estimators and use skore.evaluate to compare them with cross-validation.

from sklearn.ensemble import BaggingClassifier

from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

estimators = {}

estimators["Bagging"] = BaggingClassifier()

Exactly Balanced Bagging and Over-Bagging#

The BalancedBaggingClassifier can use in conjunction with a RandomUnderSampler or RandomOverSampler. These methods are referred as Exactly Balanced Bagging and Over-Bagging, respectively and have been proposed first in [1].

Exactly Balanced Bagging

estimators["Exactly Balanced Bagging"] = BalancedBaggingClassifier(
    sampler=RandomUnderSampler()
)

# Over-bagging
estimators["Over-Bagging"] = BalancedBaggingClassifier(sampler=RandomOverSampler())

SMOTE-Bagging#

Instead of using a RandomOverSampler that make a bootstrap, an alternative is to use SMOTE as an over-sampler. This is known as SMOTE-Bagging [2].

SMOTE-Bagging

estimators["SMOTE-Bagging"] = BalancedBaggingClassifier(sampler=SMOTE())

Roughly Balanced Bagging#

While using a RandomUnderSampler or RandomOverSampler will create exactly the desired number of samples, it does not follow the statistical spirit wanted in the bagging framework. The authors in [3] proposes to use a negative binomial distribution to compute the number of samples of the majority class to be selected and then perform a random under-sampling.

Here, we illustrate this method by implementing a function in charge of resampling and use the FunctionSampler to integrate it within a Pipeline and cross_validate.

from collections import Counter

import numpy as np

from imblearn import FunctionSampler


def roughly_balanced_bagging(X, y, replace=False):
    """Implementation of Roughly Balanced Bagging for binary problem."""
    # find the minority and majority classes
    class_counts = Counter(y)
    majority_class = max(class_counts, key=class_counts.get)
    minority_class = min(class_counts, key=class_counts.get)

    # compute the number of sample to draw from the majority class using
    # a negative binomial distribution
    n_minority_class = class_counts[minority_class]
    n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

    # draw randomly with or without replacement
    majority_indices = np.random.choice(
        np.flatnonzero(y == majority_class),
        size=n_majority_resampled,
        replace=replace,
    )
    minority_indices = np.random.choice(
        np.flatnonzero(y == minority_class),
        size=n_minority_class,
        replace=replace,
    )
    indices = np.hstack([majority_indices, minority_indices])

    return X[indices], y[indices]


# Roughly Balanced Bagging
estimators["Roughly Balanced Bagging"] = BalancedBaggingClassifier(
    sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)

Now, we can use skore.evaluate to evaluate each estimator with cross-validation and compare the results.

import pandas as pd
import skore

results = {}
for name, est in estimators.items():
    report = skore.evaluate(est, X, y, splitter=5)
    results[name] = report.metrics.summarize().frame()

df_results = pd.concat(results)
df_results
BaggingClassifier BalancedBaggingClassifier
mean std mean std
Metric Label / Average
Bagging Accuracy 0.920700 0.003365 NaN NaN
Precision 0 0.667368 0.020236 NaN NaN
1 0.939344 0.002817 NaN NaN
Recall 0 0.447700 0.026981 NaN NaN
1 0.974602 0.001455 NaN NaN
ROC AUC 0.800696 0.019618 NaN NaN
Log loss 0.940768 0.089716 NaN NaN
Brier score 0.066303 0.003218 NaN NaN
Fit time (s) 0.522311 0.013803 NaN NaN
Predict time (s) 0.001850 0.000119 NaN NaN
Exactly Balanced Bagging Accuracy NaN NaN 0.811200 0.003962
Precision 0 NaN NaN 0.311032 0.011648
1 NaN NaN 0.959820 0.005335
Recall 0 NaN NaN 0.696944 0.042534
1 NaN NaN 0.824217 0.003747
ROC AUC NaN NaN 0.835951 0.016180
Log loss NaN NaN 0.641244 0.052535
Brier score NaN NaN 0.120250 0.002101
Fit time (s) NaN NaN 0.104036 0.002062
Predict time (s) NaN NaN 0.002073 0.000079
Over-Bagging Accuracy NaN NaN 0.925900 0.002881
Precision 0 NaN NaN 0.734316 0.018370
1 NaN NaN 0.938188 0.003085
Recall 0 NaN NaN 0.432042 0.031013
1 NaN NaN 0.982176 0.001810
ROC AUC NaN NaN 0.792060 0.021648
Log loss NaN NaN 1.005626 0.142207
Brier score NaN NaN 0.065323 0.002697
Fit time (s) NaN NaN 0.677937 0.016123
Predict time (s) NaN NaN 0.002257 0.000191
SMOTE-Bagging Accuracy NaN NaN 0.878100 0.007701
Precision 0 NaN NaN 0.431821 0.023860
1 NaN NaN 0.952220 0.003762
Recall 0 NaN NaN 0.599235 0.033551
1 NaN NaN 0.909879 0.008846
ROC AUC NaN NaN 0.833201 0.018703
Log loss NaN NaN 0.645382 0.117393
Brier score NaN NaN 0.089085 0.003554
Fit time (s) NaN NaN 1.135440 0.014825
Predict time (s) NaN NaN 0.002435 0.000080
Roughly Balanced Bagging Accuracy NaN NaN 0.842800 0.006301
Precision 0 NaN NaN 0.355000 0.012089
1 NaN NaN 0.956622 0.004049
Recall 0 NaN NaN 0.655887 0.035462
1 NaN NaN 0.864096 0.008578
ROC AUC NaN NaN 0.839673 0.017463
Log loss NaN NaN 0.580439 0.047084
Brier score NaN NaN 0.104134 0.003359
Fit time (s) NaN NaN 0.101331 0.009213
Predict time (s) NaN NaN 0.001980 0.000037


Total running time of the script: (0 minutes 37.482 seconds)

Estimated memory usage: 276 MB

Gallery generated by Sphinx-Gallery