
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/model_selection/plot_instance_hardness_cv.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_model_selection_plot_instance_hardness_cv.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_model_selection_plot_instance_hardness_cv.py:


====================================================
Distribute hard-to-classify datapoints over CV folds
====================================================

'Instance hardness' refers to the difficulty to classify an instance. The way
hard-to-classify instances are distributed over train and test sets has
significant effect on the test set performance metrics. In this example we
show how to deal with this problem. We are making the comparison with normal
:class:`~sklearn.model_selection.StratifiedKFold` cross-validation splitter.

.. GENERATED FROM PYTHON SOURCE LINES 12-16

.. code-block:: Python


    # Authors: Frits Hermans, https://fritshermans.github.io
    # License: MIT








.. GENERATED FROM PYTHON SOURCE LINES 17-19

.. code-block:: Python

    print(__doc__)








.. GENERATED FROM PYTHON SOURCE LINES 20-26

Create an imbalanced dataset with instance hardness
---------------------------------------------------

We create an imbalanced dataset with using scikit-learn's
:func:`~sklearn.datasets.make_blobs` function and set the class imbalance ratio to
5%.

.. GENERATED FROM PYTHON SOURCE LINES 26-33

.. code-block:: Python

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import make_blobs

    X, y = make_blobs(n_samples=[950, 50], centers=((-3, 0), (3, 0)), random_state=10)
    _ = plt.scatter(X[:, 0], X[:, 1], c=y)




.. image-sg:: /auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_001.png
   :alt: plot instance hardness cv
   :srcset: /auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 34-35

To introduce instance hardness in our dataset, we add some hard to classify samples:

.. GENERATED FROM PYTHON SOURCE LINES 35-41

.. code-block:: Python

    X_hard, y_hard = make_blobs(
        n_samples=10, centers=((3, 0), (-3, 0)), cluster_std=1, random_state=10
    )
    X, y = np.vstack((X, X_hard)), np.hstack((y, y_hard))
    _ = plt.scatter(X[:, 0], X[:, 1], c=y)




.. image-sg:: /auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_002.png
   :alt: plot instance hardness cv
   :srcset: /auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 42-60

Compare cross validation scores using `StratifiedKFold` and `InstanceHardnessCV`
--------------------------------------------------------------------------------

Now, we want to assess a linear predictive model. Therefore, we should use
cross-validation. The most important concept with cross-validation is to create
training and test splits that are representative of the the data in production to have
statistical results that one can expect in production.

By applying a standard :class:`~sklearn.model_selection.StratifiedKFold`
cross-validation splitter, we do not control in which fold the hard-to-classify
samples will be.

The :class:`~imblearn.model_selection.InstanceHardnessCV` splitter allows to
control the distribution of the hard-to-classify samples over the folds.

Let's make an experiment to compare the results that we get with both splitters.
We use a :class:`~sklearn.linear_model.LogisticRegression` classifier and
`skore.evaluate` to calculate the cross-validation scores.

.. GENERATED FROM PYTHON SOURCE LINES 60-77

.. code-block:: Python

    import skore
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import StratifiedKFold

    from imblearn.model_selection import InstanceHardnessCV

    logistic_regression = LogisticRegression()

    splitters = {
        "StratifiedKFold": StratifiedKFold(n_splits=5, shuffle=True, random_state=10),
        "InstanceHardnessCV": InstanceHardnessCV(estimator=LogisticRegression()),
    }

    reports = {}
    for name, cv in splitters.items():
        reports[name] = skore.evaluate(logistic_regression, X, y, splitter=cv)








.. GENERATED FROM PYTHON SOURCE LINES 78-87

.. code-block:: Python

    import pandas as pd

    results = {}
    for name, report in reports.items():
        scores = report.metrics.summarize().frame()
        results[name] = scores
    results = pd.concat(results)
    results






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead tr th {
            text-align: left;
        }

        .dataframe thead tr:last-of-type th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th></th>
          <th></th>
          <th colspan="2" halign="left">LogisticRegression</th>
        </tr>
        <tr>
          <th></th>
          <th></th>
          <th></th>
          <th>mean</th>
          <th>std</th>
        </tr>
        <tr>
          <th></th>
          <th>Metric</th>
          <th>Label / Average</th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th rowspan="10" valign="top">StratifiedKFold</th>
          <th>Accuracy</th>
          <th></th>
          <td>0.987129</td>
          <td>0.005644</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>0.991742</td>
          <td>0.008597</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.920476</td>
          <td>0.088378</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>0.994764</td>
          <td>0.006412</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.854545</td>
          <td>0.152120</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>0.954498</td>
          <td>0.078860</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>0.059978</td>
          <td>0.035528</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>0.011214</td>
          <td>0.004682</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>0.002103</td>
          <td>0.001512</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>0.000129</td>
          <td>0.000136</td>
        </tr>
        <tr>
          <th rowspan="10" valign="top">InstanceHardnessCV</th>
          <th>Accuracy</th>
          <th></th>
          <td>0.988119</td>
          <td>0.002711</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Precision</th>
          <th>0</th>
          <td>0.992692</td>
          <td>0.002838</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.905455</td>
          <td>0.004979</td>
        </tr>
        <tr>
          <th rowspan="2" valign="top">Recall</th>
          <th>0</th>
          <td>0.994764</td>
          <td>0.000000</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.872727</td>
          <td>0.049793</td>
        </tr>
        <tr>
          <th>ROC AUC</th>
          <th></th>
          <td>0.951404</td>
          <td>0.028972</td>
        </tr>
        <tr>
          <th>Log loss</th>
          <th></th>
          <td>0.056310</td>
          <td>0.004036</td>
        </tr>
        <tr>
          <th>Brier score</th>
          <th></th>
          <td>0.010882</td>
          <td>0.000819</td>
        </tr>
        <tr>
          <th>Fit time (s)</th>
          <th></th>
          <td>0.001204</td>
          <td>0.000086</td>
        </tr>
        <tr>
          <th>Predict time (s)</th>
          <th></th>
          <td>0.000063</td>
          <td>0.000004</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 88-93

The :class:`~imblearn.model_selection.InstanceHardnessCV`
splitter results in less variation of average precision than
:class:`~sklearn.model_selection.StratifiedKFold` splitter. When doing
hyperparameter tuning or feature selection using a wrapper method (like
:class:`~sklearn.feature_selection.RFECV`) this will give more stable results.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 17.899 seconds)

**Estimated memory usage:**  286 MB


.. _sphx_glr_download_auto_examples_model_selection_plot_instance_hardness_cv.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_instance_hardness_cv.ipynb <plot_instance_hardness_cv.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_instance_hardness_cv.py <plot_instance_hardness_cv.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_instance_hardness_cv.zip <plot_instance_hardness_cv.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
