
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/ensemble/plot_bagging_classifier.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_ensemble_plot_bagging_classifier.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_ensemble_plot_bagging_classifier.py:


=================================
Bagging classifiers using sampler
=================================

In this example, we show how
:class:`~imblearn.ensemble.BalancedBaggingClassifier` can be used to create a
large variety of classifiers by giving different samplers.

We will give several examples that have been published in the passed year.

.. GENERATED FROM PYTHON SOURCE LINES 12-16

.. code-block:: Python


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT








.. GENERATED FROM PYTHON SOURCE LINES 17-19

.. code-block:: Python

    print(__doc__)








.. GENERATED FROM PYTHON SOURCE LINES 20-26

Generate an imbalanced dataset
------------------------------

For this example, we will create a synthetic dataset using the function
:func:`~sklearn.datasets.make_classification`. The problem will be a toy
classification problem with a ratio of 1:9 between the two classes.

.. GENERATED FROM PYTHON SOURCE LINES 28-38

.. code-block:: Python

    from sklearn.datasets import make_classification

    X, y = make_classification(
        n_samples=10_000,
        n_features=10,
        weights=[0.1, 0.9],
        class_sep=0.5,
        random_state=0,
    )








.. GENERATED FROM PYTHON SOURCE LINES 39-43

.. code-block:: Python

    import pandas as pd

    pd.Series(y).value_counts(normalize=True)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    1    0.8977
    0    0.1023
    Name: proportion, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 44-51

In the following sections, we will show a couple of algorithms that have
been proposed over the years. We intend to illustrate how one can reuse the
:class:`~imblearn.ensemble.BalancedBaggingClassifier` by passing different
sampler.

We collect all estimators and use `skore.evaluate` to compare them
with cross-validation.

.. GENERATED FROM PYTHON SOURCE LINES 53-63

.. code-block:: Python

    from sklearn.ensemble import BaggingClassifier

    from imblearn.ensemble import BalancedBaggingClassifier
    from imblearn.over_sampling import SMOTE, RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler

    estimators = {}

    estimators["Bagging"] = BaggingClassifier()








.. GENERATED FROM PYTHON SOURCE LINES 64-72

Exactly Balanced Bagging and Over-Bagging
-----------------------------------------

The :class:`~imblearn.ensemble.BalancedBaggingClassifier` can use in
conjunction with a :class:`~imblearn.under_sampling.RandomUnderSampler` or
:class:`~imblearn.over_sampling.RandomOverSampler`. These methods are
referred as Exactly Balanced Bagging and Over-Bagging, respectively and have
been proposed first in [1]_.

.. GENERATED FROM PYTHON SOURCE LINES 74-75

Exactly Balanced Bagging

.. GENERATED FROM PYTHON SOURCE LINES 75-82

.. code-block:: Python

    estimators["Exactly Balanced Bagging"] = BalancedBaggingClassifier(
        sampler=RandomUnderSampler()
    )

    # Over-bagging
    estimators["Over-Bagging"] = BalancedBaggingClassifier(sampler=RandomOverSampler())








.. GENERATED FROM PYTHON SOURCE LINES 83-90

SMOTE-Bagging
-------------

Instead of using a :class:`~imblearn.over_sampling.RandomOverSampler` that
make a bootstrap, an alternative is to use
:class:`~imblearn.over_sampling.SMOTE` as an over-sampler. This is known as
SMOTE-Bagging [2]_.

.. GENERATED FROM PYTHON SOURCE LINES 92-93

SMOTE-Bagging

.. GENERATED FROM PYTHON SOURCE LINES 93-95

.. code-block:: Python

    estimators["SMOTE-Bagging"] = BalancedBaggingClassifier(sampler=SMOTE())








.. GENERATED FROM PYTHON SOURCE LINES 96-109

Roughly Balanced Bagging
------------------------
While using a :class:`~imblearn.under_sampling.RandomUnderSampler` or
:class:`~imblearn.over_sampling.RandomOverSampler` will create exactly the
desired number of samples, it does not follow the statistical spirit wanted
in the bagging framework. The authors in [3]_ proposes to use a negative
binomial distribution to compute the number of samples of the majority
class to be selected and then perform a random under-sampling.

Here, we illustrate this method by implementing a function in charge of
resampling and use the :class:`~imblearn.FunctionSampler` to integrate it
within a :class:`~imblearn.pipeline.Pipeline` and
:func:`~sklearn.model_selection.cross_validate`.

.. GENERATED FROM PYTHON SOURCE LINES 111-151

.. code-block:: Python

    from collections import Counter

    import numpy as np

    from imblearn import FunctionSampler


    def roughly_balanced_bagging(X, y, replace=False):
        """Implementation of Roughly Balanced Bagging for binary problem."""
        # find the minority and majority classes
        class_counts = Counter(y)
        majority_class = max(class_counts, key=class_counts.get)
        minority_class = min(class_counts, key=class_counts.get)

        # compute the number of sample to draw from the majority class using
        # a negative binomial distribution
        n_minority_class = class_counts[minority_class]
        n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

        # draw randomly with or without replacement
        majority_indices = np.random.choice(
            np.flatnonzero(y == majority_class),
            size=n_majority_resampled,
            replace=replace,
        )
        minority_indices = np.random.choice(
            np.flatnonzero(y == minority_class),
            size=n_minority_class,
            replace=replace,
        )
        indices = np.hstack([majority_indices, minority_indices])

        return X[indices], y[indices]


    # Roughly Balanced Bagging
    estimators["Roughly Balanced Bagging"] = BalancedBaggingClassifier(
        sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
    )








.. GENERATED FROM PYTHON SOURCE LINES 152-154

Now, we can use `skore.evaluate` to evaluate each estimator with
cross-validation and compare the results.

.. GENERATED FROM PYTHON SOURCE LINES 156-168

.. code-block:: Python

    import pandas as pd
    import skore

    results = {}
    for name, est in estimators.items():
        report = skore.evaluate(est, X, y, splitter=5)
        results[name] = report.metrics.summarize().frame()

    df_results = pd.concat(results)
    df_results







.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead tr th {
            text-align: left;
        }

        .dataframe thead tr:last-of-type th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th></th>
          <th></th>
          <th colspan="2" halign="left">BaggingClassifier</th>
          <th colspan="2" halign="left">BalancedBaggingClassifier</th>
        </tr>
        <tr>
          <th></th>
          <th></th>
          <th></th>
          <th>mean</th>
          <th>std</th>
          <th>mean</th>
          <th>std</th>
        </tr>
        <tr>
          <th></th>
          <th>Metric</th>
          <th>Label / Average</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th rowspan="10" valign="top">Bagging</th>
          <th>Accuracy</th>
          <th></th>
          <td>0.920700</td>
          <td>0.003365</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>0.667368</td>
          <td>0.020236</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.939344</td>
          <td>0.002817</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>0.447700</td>
          <td>0.026981</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.974602</td>
          <td>0.001455</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>0.800696</td>
          <td>0.019618</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>0.940768</td>
          <td>0.089716</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>0.066303</td>
          <td>0.003218</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>0.522311</td>
          <td>0.013803</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>0.001850</td>
          <td>0.000119</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th rowspan="10" valign="top">Exactly Balanced Bagging</th>
          <th>Accuracy</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.811200</td>
          <td>0.003962</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.311032</td>
          <td>0.011648</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.959820</td>
          <td>0.005335</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.696944</td>
          <td>0.042534</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.824217</td>
          <td>0.003747</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.835951</td>
          <td>0.016180</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.641244</td>
          <td>0.052535</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.120250</td>
          <td>0.002101</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.104036</td>
          <td>0.002062</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.002073</td>
          <td>0.000079</td>
        </tr>
        <tr>
          <th rowspan="10" valign="top">Over-Bagging</th>
          <th>Accuracy</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.925900</td>
          <td>0.002881</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.734316</td>
          <td>0.018370</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.938188</td>
          <td>0.003085</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.432042</td>
          <td>0.031013</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.982176</td>
          <td>0.001810</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.792060</td>
          <td>0.021648</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>1.005626</td>
          <td>0.142207</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.065323</td>
          <td>0.002697</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.677937</td>
          <td>0.016123</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.002257</td>
          <td>0.000191</td>
        </tr>
        <tr>
          <th rowspan="10" valign="top">SMOTE-Bagging</th>
          <th>Accuracy</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.878100</td>
          <td>0.007701</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.431821</td>
          <td>0.023860</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.952220</td>
          <td>0.003762</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.599235</td>
          <td>0.033551</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.909879</td>
          <td>0.008846</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.833201</td>
          <td>0.018703</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.645382</td>
          <td>0.117393</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.089085</td>
          <td>0.003554</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>1.135440</td>
          <td>0.014825</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.002435</td>
          <td>0.000080</td>
        </tr>
        <tr>
          <th rowspan="10" valign="top">Roughly Balanced Bagging</th>
          <th>Accuracy</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.842800</td>
          <td>0.006301</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.355000</td>
          <td>0.012089</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.956622</td>
          <td>0.004049</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.655887</td>
          <td>0.035462</td>
        </tr>
        <tr>
          <th>1</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.864096</td>
          <td>0.008578</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.839673</td>
          <td>0.017463</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.580439</td>
          <td>0.047084</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.104134</td>
          <td>0.003359</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.101331</td>
          <td>0.009213</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.001980</td>
          <td>0.000037</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 169-181

.. topic:: References:

   .. [1] R. Maclin, and D. Opitz. "An empirical evaluation of bagging and
          boosting." AAAI/IAAI 1997 (1997): 546-551.

   .. [2] S. Wang, and X. Yao. "Diversity analysis on imbalanced data sets by
          using ensemble models." 2009 IEEE symposium on computational
          intelligence and data mining. IEEE, 2009.

   .. [3] S. Hido, H. Kashima, and Y. Takahashi. "Roughly balanced bagging
         for imbalanced data." Statistical Analysis and Data Mining: The ASA
         Data Science Journal 2.5‐6 (2009): 412-426.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 37.482 seconds)

**Estimated memory usage:**  276 MB


.. _sphx_glr_download_auto_examples_ensemble_plot_bagging_classifier.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_bagging_classifier.ipynb <plot_bagging_classifier.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_bagging_classifier.py <plot_bagging_classifier.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_bagging_classifier.zip <plot_bagging_classifier.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
