Note
Go to the end to download the full example code.
Example of topic classification in text documents#
This example shows how to balance the text data before to train a classifier.
Note that for this example, the data are slightly imbalanced but it can happen that for some data sets, the imbalanced ratio is more significant.
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)
Setting the data set#
We use a part of the 20 newsgroups data set by loading 4 topics. Using the scikit-learn loader, the data are split into a training and a testing set.
Note the class #3 is the minority class and has almost twice less samples than the majority class.
from sklearn.datasets import fetch_20newsgroups
categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
]
newsgroups_train = fetch_20newsgroups(subset="train", categories=categories)
newsgroups_test = fetch_20newsgroups(subset="test", categories=categories)
import numpy as np
X_train = np.array(newsgroups_train.data)
X_test = np.array(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target
Training class distributions summary: Counter({np.int64(2): 593, np.int64(1): 584, np.int64(0): 480, np.int64(3): 377})
Test class distributions summary: Counter({np.int64(2): 394, np.int64(1): 389, np.int64(0): 319, np.int64(3): 251})
The usual scikit-learn pipeline#
You might usually use scikit-learn pipeline by combining the TF-IDF vectorizer to feed a multinomial naive bayes classifier. A classification report summarized the results on the testing set.
As expected, the recall of the class #3 is low mainly due to the class imbalanced.
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
('multinomialnb', MultinomialNB())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Fitted attributes
| Name | Type | Value |
|---|---|---|
|
classes_
classes_: ndarray of shape (n_classes,) The classes labels. Only exist if the last step of the pipeline is a classifier. |
ndarray[int64](4,) | [0,1,2,3] |
Parameters
Fitted attributes
| Name | Type | Value |
|---|---|---|
|
fixed_vocabulary_
fixed_vocabulary_: bool True if a fixed vocabulary of term to indices mapping is provided by the user. |
bool | False |
|
idf_
idf_: array of shape (n_features,) The inverse document frequency (IDF) vector; only defined if ``use_idf`` is True. |
ndarray[float64](34118,) | [4.73,4.36,7.52,...,7.93,7.93,7.93] |
|
vocabulary_
vocabulary_: dict A mapping of terms to feature indices. |
dict | {'00': 0, '000': 1, '0000': 2, '00000': 3, ...} |
Parameters
Fitted attributes
| MultinomialNB | ||
|---|---|---|
| Metric | Label / Average | |
| Accuracy | 0.837398 | |
| Precision | 0 | 0.674888 |
| 1 | 0.964865 | |
| 2 | 0.867117 | |
| 3 | 0.967742 | |
| Recall | 0 | 0.943574 |
| 1 | 0.917738 | |
| 2 | 0.977157 | |
| 3 | 0.358566 | |
| ROC AUC | 0 | 0.960087 |
| 1 | 0.992411 | |
| 2 | 0.993738 | |
| 3 | 0.944064 | |
| Log loss | 0.536984 | |
| Fit time (s) | NaN | |
| Predict time (s) | 0.288406 |
Balancing the class before classification#
To improve the prediction of the class #3, it could be interesting to apply
a balancing before to train the naive bayes classifier. Therefore, we will
use a RandomUnderSampler to equalize the
number of samples in all the classes before the training.
It is also important to note that we are using the
make_pipeline function implemented in
imbalanced-learn to properly handle the samplers.
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.under_sampling import RandomUnderSampler
model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())
model.fit(X_train, y_train)
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
('randomundersampler', RandomUnderSampler()),
('multinomialnb', MultinomialNB())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('tfidfvectorizer', ...), ('randomundersampler', ...), ...] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Fitted attributes
| Name | Type | Value |
|---|---|---|
| classes_ | ndarray[int64](4,) | [0,1,2,3] |
Parameters
Fitted attributes
| Name | Type | Value |
|---|---|---|
|
fixed_vocabulary_
fixed_vocabulary_: bool True if a fixed vocabulary of term to indices mapping is provided by the user. |
bool | False |
|
idf_
idf_: array of shape (n_features,) The inverse document frequency (IDF) vector; only defined if ``use_idf`` is True. |
ndarray[float64](34118,) | [4.73,4.36,7.52,...,7.93,7.93,7.93] |
|
vocabulary_
vocabulary_: dict A mapping of terms to feature indices. |
dict | {'00': 0, '000': 1, '0000': 2, '00000': 3, ...} |
Parameters
| sampling_strategy | 'auto' | |
| random_state | None | |
| replacement | False |
Fitted attributes
| Name | Type | Value |
|---|---|---|
| n_features_in_ | int | 34118 |
| sample_indices_ | ndarray[int64](1508,) | [1360,1543, 493,...,2011,2012,2014] |
| sampling_strategy_ | OrderedDict | OrderedDict({...p.int64(377)}) |
Parameters
Fitted attributes
Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class #3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better.
| MultinomialNB | ||
|---|---|---|
| Metric | Label / Average | |
| Accuracy | 0.850702 | |
| Precision | 0 | 0.697561 |
| 1 | 0.979351 | |
| 2 | 0.945946 | |
| 3 | 0.782051 | |
| Recall | 0 | 0.896552 |
| 1 | 0.853470 | |
| 2 | 0.888325 | |
| 3 | 0.729084 | |
| ROC AUC | 0 | 0.962858 |
| 1 | 0.987576 | |
| 2 | 0.989366 | |
| 3 | 0.938616 | |
| Log loss | 0.629846 | |
| Fit time (s) | NaN | |
| Predict time (s) | 0.273553 |
Total running time of the script: (0 minutes 24.071 seconds)
Estimated memory usage: 1033 MB