Note
Go to the end to download the full example code.
Bagging classifiers using sampler#
In this example, we show how
BalancedBaggingClassifier can be used to create a
large variety of classifiers by giving different samplers.
We will give several examples that have been published in the passed year.
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)
Generate an imbalanced dataset#
For this example, we will create a synthetic dataset using the function
make_classification. The problem will be a toy
classification problem with a ratio of 1:9 between the two classes.
1 0.8977
0 0.1023
Name: proportion, dtype: float64
In the following sections, we will show a couple of algorithms that have
been proposed over the years. We intend to illustrate how one can reuse the
BalancedBaggingClassifier by passing different
sampler.
We collect all estimators and use skore.evaluate to compare them
with cross-validation.
from sklearn.ensemble import BaggingClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
estimators = {}
estimators["Bagging"] = BaggingClassifier()
Exactly Balanced Bagging and Over-Bagging#
The BalancedBaggingClassifier can use in
conjunction with a RandomUnderSampler or
RandomOverSampler. These methods are
referred as Exactly Balanced Bagging and Over-Bagging, respectively and have
been proposed first in [1].
Exactly Balanced Bagging
estimators["Exactly Balanced Bagging"] = BalancedBaggingClassifier(
sampler=RandomUnderSampler()
)
# Over-bagging
estimators["Over-Bagging"] = BalancedBaggingClassifier(sampler=RandomOverSampler())
SMOTE-Bagging#
Instead of using a RandomOverSampler that
make a bootstrap, an alternative is to use
SMOTE as an over-sampler. This is known as
SMOTE-Bagging [2].
SMOTE-Bagging
estimators["SMOTE-Bagging"] = BalancedBaggingClassifier(sampler=SMOTE())
Roughly Balanced Bagging#
While using a RandomUnderSampler or
RandomOverSampler will create exactly the
desired number of samples, it does not follow the statistical spirit wanted
in the bagging framework. The authors in [3] proposes to use a negative
binomial distribution to compute the number of samples of the majority
class to be selected and then perform a random under-sampling.
Here, we illustrate this method by implementing a function in charge of
resampling and use the FunctionSampler to integrate it
within a Pipeline and
cross_validate.
from collections import Counter
import numpy as np
from imblearn import FunctionSampler
def roughly_balanced_bagging(X, y, replace=False):
"""Implementation of Roughly Balanced Bagging for binary problem."""
# find the minority and majority classes
class_counts = Counter(y)
majority_class = max(class_counts, key=class_counts.get)
minority_class = min(class_counts, key=class_counts.get)
# compute the number of sample to draw from the majority class using
# a negative binomial distribution
n_minority_class = class_counts[minority_class]
n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)
# draw randomly with or without replacement
majority_indices = np.random.choice(
np.flatnonzero(y == majority_class),
size=n_majority_resampled,
replace=replace,
)
minority_indices = np.random.choice(
np.flatnonzero(y == minority_class),
size=n_minority_class,
replace=replace,
)
indices = np.hstack([majority_indices, minority_indices])
return X[indices], y[indices]
# Roughly Balanced Bagging
estimators["Roughly Balanced Bagging"] = BalancedBaggingClassifier(
sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)
Now, we can use skore.evaluate to evaluate each estimator with
cross-validation and compare the results.
import pandas as pd
import skore
results = {}
for name, est in estimators.items():
report = skore.evaluate(est, X, y, splitter=5)
results[name] = report.metrics.summarize().frame()
df_results = pd.concat(results)
df_results
Total running time of the script: (0 minutes 37.482 seconds)
Estimated memory usage: 276 MB