
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/applications/plot_impact_imbalanced_classes.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_applications_plot_impact_imbalanced_classes.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_applications_plot_impact_imbalanced_classes.py:


==========================================================
Fitting model on imbalanced datasets and how to fight bias
==========================================================

This example illustrates the problem induced by learning on datasets having
imbalanced classes. Subsequently, we compare different approaches alleviating
these negative effects.

.. GENERATED FROM PYTHON SOURCE LINES 10-14

.. code-block:: Python


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT








.. GENERATED FROM PYTHON SOURCE LINES 15-17

.. code-block:: Python

    print(__doc__)








.. GENERATED FROM PYTHON SOURCE LINES 18-27

Problem definition
------------------

We are dropping the following features:

- "fnlwgt": this feature was created while studying the "adult" dataset.
  Thus, we will not use this feature which is not acquired during the survey.
- "education-num": it is encoding the same information than "education".
  Thus, we are removing one of these 2 features.

.. GENERATED FROM PYTHON SOURCE LINES 29-34

.. code-block:: Python

    from sklearn.datasets import fetch_openml

    df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
    df = df.drop(columns=["fnlwgt", "education-num"])








.. GENERATED FROM PYTHON SOURCE LINES 35-36

The "adult" dataset as a class ratio of about 3:1

.. GENERATED FROM PYTHON SOURCE LINES 38-41

.. code-block:: Python

    classes_count = y.value_counts()
    classes_count





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    class
    <=50K    37155
    >50K     11687
    Name: count, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 42-44

This dataset is only slightly imbalanced. To better highlight the effect of
learning from an imbalanced dataset, we will increase its ratio to 30:1

.. GENERATED FROM PYTHON SOURCE LINES 46-56

.. code-block:: Python

    from imblearn.datasets import make_imbalance

    ratio, pos_label = 30, ">50K"
    df_res, y_res = make_imbalance(
        df,
        y,
        sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},
    )
    y_res.value_counts()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    class
    <=50K    37155
    >50K      1238
    Name: count, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 57-62

We will use `skore.evaluate` to get an estimate of the test scores using
cross-validation.

As a baseline, we could use a classifier which will always predict the
majority class independently of the features provided.

.. GENERATED FROM PYTHON SOURCE LINES 64-71

.. code-block:: Python

    import skore
    from sklearn.dummy import DummyClassifier

    dummy_clf = DummyClassifier(strategy="most_frequent")
    report = skore.evaluate(dummy_clf, df_res, y_res, splitter=5, pos_label=pos_label)
    report.metrics.summarize().frame()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none



    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])




.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead tr th {
            text-align: left;
        }

        .dataframe thead tr:last-of-type th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th colspan="2" halign="left">DummyClassifier</th>
        </tr>
        <tr>
          <th></th>
          <th>mean</th>
          <th>std</th>
        </tr>
        <tr>
          <th>Metric</th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Accuracy</th>
          <td>0.967755</td>
          <td>0.000069</td>
        </tr>
        <tr>
          <th>Precision</th>
          <td>0.000000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>Recall</th>
          <td>0.000000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <td>0.500000</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <td>1.162244</td>
          <td>0.002488</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <td>0.032245</td>
          <td>0.000069</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <td>0.006850</td>
          <td>0.000149</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <td>0.000107</td>
          <td>0.000085</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 72-84

Strategies to learn from an imbalanced dataset
-----------------------------------------------

We will compare different strategies to learn from an imbalanced dataset by
collecting all estimators and evaluating each using `skore.evaluate` to
compute cross-validated metrics.

Dummy baseline
..............

Before to train a real machine learning model, we can store the results
obtained with our :class:`~sklearn.dummy.DummyClassifier`.

.. GENERATED FROM PYTHON SOURCE LINES 86-88

.. code-block:: Python

    estimators = [("Dummy classifier", dummy_clf)]








.. GENERATED FROM PYTHON SOURCE LINES 89-97

Linear classifier baseline
..........................

We use `skrub.tabular_pipeline` to create a machine learning pipeline with
proper preprocessing automatically adapted to the estimator. For a
:class:`~sklearn.linear_model.LogisticRegression`, it will automatically
handle missing values, encode categorical columns, and scale numerical
columns.

.. GENERATED FROM PYTHON SOURCE LINES 99-105

.. code-block:: Python

    from sklearn.linear_model import LogisticRegression
    from skrub import tabular_pipeline

    lr_clf = tabular_pipeline(LogisticRegression(max_iter=1000))
    estimators.append(("Logistic regression", lr_clf))








.. GENERATED FROM PYTHON SOURCE LINES 106-110

We can verify that something similar is happening with a tree-based model
such as :class:`~sklearn.ensemble.RandomForestClassifier`. `tabular_pipeline`
will automatically adapt the preprocessing for tree-based models (e.g. no
scaling needed).

.. GENERATED FROM PYTHON SOURCE LINES 112-117

.. code-block:: Python

    from sklearn.ensemble import RandomForestClassifier

    rf_clf = tabular_pipeline(RandomForestClassifier(random_state=42, n_jobs=2))
    estimators.append(("Random forest", rf_clf))








.. GENERATED FROM PYTHON SOURCE LINES 118-128

Use `class_weight`
..................

Most of the models in `scikit-learn` have a parameter `class_weight`. This
parameter will affect the computation of the loss in linear model or the
criterion in the tree-based model to penalize differently a false
classification from the minority and majority class. We can set
`class_weight="balanced"` such that the weight applied is inversely
proportional to the class frequency. We test this parametrization in both
linear model and tree-based model.

.. GENERATED FROM PYTHON SOURCE LINES 130-140

.. code-block:: Python

    lr_clf_balanced = tabular_pipeline(
        LogisticRegression(max_iter=1000, class_weight="balanced")
    )
    estimators.append(("Logistic regression with balanced class weights", lr_clf_balanced))

    rf_clf_balanced = tabular_pipeline(
        RandomForestClassifier(random_state=42, n_jobs=2, class_weight="balanced")
    )
    estimators.append(("Random forest with balanced class weights", rf_clf_balanced))








.. GENERATED FROM PYTHON SOURCE LINES 141-151

Resample the training set during learning
.........................................

Another way is to resample the training set by under-sampling or
over-sampling some of the samples. `imbalanced-learn` provides some samplers
to do such processing.

We need to use the `imbalanced-learn` pipeline to properly handle the
samplers within the pipeline. We insert the sampler before the final
estimator in the pipeline created by `skrub.tabular_pipeline`.

.. GENERATED FROM PYTHON SOURCE LINES 153-168

.. code-block:: Python

    from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
    from imblearn.under_sampling import RandomUnderSampler

    # We extract the preprocessing steps and the estimator from the tabular
    # pipeline and insert the sampler before the estimator.
    lr_clf_undersampled = make_pipeline_with_sampler(
        *lr_clf[:-1], RandomUnderSampler(random_state=42), lr_clf[-1]
    )
    estimators.append(("Under-sampling + Logistic regression", lr_clf_undersampled))

    rf_clf_undersampled = make_pipeline_with_sampler(
        *rf_clf[:-1], RandomUnderSampler(random_state=42), rf_clf[-1]
    )
    estimators.append(("Under-sampling + Random forest", rf_clf_undersampled))








.. GENERATED FROM PYTHON SOURCE LINES 169-177

Use of specific balanced algorithms from imbalanced-learn
.........................................................

We already showed that random under-sampling can be effective on decision
tree. However, instead of under-sampling once the dataset, one could
under-sample the original dataset before to take a bootstrap sample. This is
the base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and
:class:`~imblearn.ensemble.BalancedBaggingClassifier`.

.. GENERATED FROM PYTHON SOURCE LINES 179-192

.. code-block:: Python

    from imblearn.ensemble import BalancedRandomForestClassifier

    brf_clf = tabular_pipeline(
        BalancedRandomForestClassifier(
            sampling_strategy="all",
            replacement=True,
            bootstrap=False,
            random_state=42,
            n_jobs=2,
        )
    )
    estimators.append(("Balanced random forest", brf_clf))








.. GENERATED FROM PYTHON SOURCE LINES 193-207

.. code-block:: Python

    from sklearn.ensemble import HistGradientBoostingClassifier

    from imblearn.ensemble import BalancedBaggingClassifier

    bag_clf = tabular_pipeline(
        BalancedBaggingClassifier(
            estimator=HistGradientBoostingClassifier(random_state=42),
            n_estimators=10,
            random_state=42,
            n_jobs=2,
        )
    )
    estimators.append(("Balanced bag of histogram gradient boosting", bag_clf))








.. GENERATED FROM PYTHON SOURCE LINES 208-210

Now, we can use `skore.evaluate` to evaluate each estimator with
cross-validation and collect all results in a single dataframe.

.. GENERATED FROM PYTHON SOURCE LINES 212-222

.. code-block:: Python

    import pandas as pd

    results = {}
    for name, est in estimators:
        report = skore.evaluate(est, df_res, y_res, splitter=5, pos_label=pos_label)
        results[name] = report.metrics.summarize().frame()

    df_results = pd.concat(results)
    df_results





.. rst-class:: sphx-glr-script-out

 .. code-block:: none



    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


    /Users/glemaitre/Documents/packages/python-stack/scikit-learn/scikit-learn/sklearn/metrics/_classification.py:1884: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])




































































































.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead tr th {
            text-align: left;
        }

        .dataframe thead tr:last-of-type th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th></th>
          <th colspan="2" halign="left">DummyClassifier</th>
          <th colspan="2" halign="left">LogisticRegression</th>
          <th colspan="2" halign="left">RandomForestClassifier</th>
          <th colspan="2" halign="left">BalancedRandomForestClassifier</th>
          <th colspan="2" halign="left">BalancedBaggingClassifier</th>
        </tr>
        <tr>
          <th></th>
          <th></th>
          <th>mean</th>
          <th>std</th>
          <th>mean</th>
          <th>std</th>
          <th>mean</th>
          <th>std</th>
          <th>mean</th>
          <th>std</th>
          <th>mean</th>
          <th>std</th>
        </tr>
        <tr>
          <th></th>
          <th>Metric</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th rowspan="5" valign="top">Dummy classifier</th>
          <th>Accuracy</th>
          <td>0.967755</td>
          <td>0.000069</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Precision</th>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Recall</th>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <td>0.500000</td>
          <td>0.000000</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <td>1.162244</td>
          <td>0.002488</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>...</th>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th rowspan="5" valign="top">Balanced bag of histogram gradient boosting</th>
          <th>ROC AUC</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.917346</td>
          <td>0.005819</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.309216</td>
          <td>0.004849</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.103621</td>
          <td>0.002158</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>6.692415</td>
          <td>1.312802</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.441978</td>
          <td>0.007606</td>
        </tr>
      </tbody>
    </table>
    <p>72 rows × 10 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 223-226

This last approach is the most effective. The different under-sampling allows
to bring some diversity for the different GBDT to learn and not focus on a
portion of the majority class.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (2 minutes 0.982 seconds)

**Estimated memory usage:**  755 MB


.. _sphx_glr_download_auto_examples_applications_plot_impact_imbalanced_classes.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_impact_imbalanced_classes.ipynb <plot_impact_imbalanced_classes.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_impact_imbalanced_classes.py <plot_impact_imbalanced_classes.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_impact_imbalanced_classes.zip <plot_impact_imbalanced_classes.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
