Example of topic classification in text documents#

This example shows how to balance the text data before to train a classifier.

Note that for this example, the data are slightly imbalanced but it can happen that for some data sets, the imbalanced ratio is more significant.

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

Setting the data set#

We use a part of the 20 newsgroups data set by loading 4 topics. Using the scikit-learn loader, the data are split into a training and a testing set.

Note the class #3 is the minority class and has almost twice less samples than the majority class.

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]
newsgroups_train = fetch_20newsgroups(subset="train", categories=categories)
newsgroups_test = fetch_20newsgroups(subset="test", categories=categories)

import numpy as np

X_train = np.array(newsgroups_train.data)
X_test = np.array(newsgroups_test.data)

y_train = newsgroups_train.target
y_test = newsgroups_test.target
from collections import Counter

print(f"Training class distributions summary: {Counter(y_train)}")
print(f"Test class distributions summary: {Counter(y_test)}")
Training class distributions summary: Counter({np.int64(2): 593, np.int64(1): 584, np.int64(0): 480, np.int64(3): 377})
Test class distributions summary: Counter({np.int64(2): 394, np.int64(1): 389, np.int64(0): 319, np.int64(3): 251})

The usual scikit-learn pipeline#

You might usually use scikit-learn pipeline by combining the TF-IDF vectorizer to feed a multinomial naive bayes classifier. A classification report summarized the results on the testing set.

As expected, the recall of the class #3 is low mainly due to the class imbalanced.

import skore
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('multinomialnb', MultinomialNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


report = skore.evaluate(model, X_test, y_test, splitter="prefit")
report.metrics.summarize().frame()
MultinomialNB
Metric Label / Average
Accuracy 0.837398
Precision 0 0.674888
1 0.964865
2 0.867117
3 0.967742
Recall 0 0.943574
1 0.917738
2 0.977157
3 0.358566
ROC AUC 0 0.960087
1 0.992411
2 0.993738
3 0.944064
Log loss 0.536984
Fit time (s) NaN
Predict time (s) 0.288406


Balancing the class before classification#

To improve the prediction of the class #3, it could be interesting to apply a balancing before to train the naive bayes classifier. Therefore, we will use a RandomUnderSampler to equalize the number of samples in all the classes before the training.

It is also important to note that we are using the make_pipeline function implemented in imbalanced-learn to properly handle the samplers.

from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.under_sampling import RandomUnderSampler

model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())

model.fit(X_train, y_train)
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('randomundersampler', RandomUnderSampler()),
                ('multinomialnb', MultinomialNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class #3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better.

report = skore.evaluate(model, X_test, y_test, splitter="prefit")
report.metrics.summarize().frame()
MultinomialNB
Metric Label / Average
Accuracy 0.850702
Precision 0 0.697561
1 0.979351
2 0.945946
3 0.782051
Recall 0 0.896552
1 0.853470
2 0.888325
3 0.729084
ROC AUC 0 0.962858
1 0.987576
2 0.989366
3 0.938616
Log loss 0.629846
Fit time (s) NaN
Predict time (s) 0.273553


Total running time of the script: (0 minutes 24.071 seconds)

Estimated memory usage: 1033 MB

Gallery generated by Sphinx-Gallery