Clustering based over-sampling¶
A practical guide¶
One way to fight the imbalanced learning problem is to generate new samples in
the classes which are under-represented. Many algorithms have been proposed for
this task, tend to generate unnecessary noise and ignore the within class
imbalance problem. The package cluster-over-sampling extends the functionality
of imbalanced-learn’s over-samplers by introducing the clusterer
and
distributor
parameters:
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from sklearn.cluster import KMeans
>>> from clover.over_sampling import SMOTE
>>> from clover.distribution import DensityDistributor
>>> X, y = make_classification(n_classes=3, weights=[0.10, 0.10, 0.80], random_state=0, n_informative=10)
>>> kmeans_smote = SMOTE(clusterer=KMeans(random_state=1), distributor=DensityDistributor())
>>> X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 80), (1, 80), (2, 80)]
The augmented data set should be used instead of the original data set to train a classifier:
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier()
>>> clf.fit(X_resampled, y_resampled) # doctest : +ELLIPSIS
DecisionTreeClassifier(...)
Parameter clusterer
¶
The parameter clusterer
is used to define the clustering algorithm that is
applied to the input matrix. All of scikit-learn’s clusterers are supported.
For example, if we select SMOTE
[CBHK2002] as the over-sampler and
KMeans
as the clustering algorithm than the clustering based over-sampling algorithm
described in [DB2018] is created:
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from sklearn.cluster import KMeans
>>> from clover.over_sampling import SMOTE
>>> from clover.distribution import DensityDistributor
>>> X, y = make_classification(n_classes=3, weights=[0.10, 0.10, 0.80], random_state=0, n_informative=10)
>>> kmeans_smote = SMOTE(clusterer=KMeans(random_state=2), distributor=DensityDistributor(), random_state=3)
>>> X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 80), (1, 80), (2, 80)]
Similarly, any other combination of an over-sampler and clusterer can be selected:
>>> from clover.over_sampling import RandomOverSampler
>>> from sklearn.cluster import AffinityPropagation
>>> affinity_ros = RandomOverSampler(clusterer=AffinityPropagation(), distributor=DensityDistributor(), random_state=4)
>>> X_resampled, y_resampled = affinity_ros.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 80), (1, 80), (2, 80)]
Additionally, if the clusterer supports a neighboring structure for the
clusters through a neighbors_
attribute, then it can be used to generate
inter-cluster artificial data as suggested in [DB2017].
Parameter distributor
¶
The parameter distributor
is used to define the distribution of the
generated samples to the clusters. The class DensityDistributor
is provided but any other distributor can be defined by extending the
BaseDistributor
class:
>>> distributor = DensityDistributor()
>>> clusterer = KMeans(n_clusters=6, random_state=1).fit(X, y)
>>> labels = clusterer.labels_
>>> intra_distribution, inter_distribution = distributor.fit_distribute(X, y, labels, neighbors=None)
>>> print(distributor.filtered_clusters_)
[(3, 0), (3, 1)]
>>> print(distributor.clusters_density_)
{(3, 0): 6.0, (3, 1): 6.0}
>>> print(intra_distribution)
{(3, 0): 1.0, (3, 1): 1.0}
>>> print(inter_distribution)
{}
Compatibility¶
Any over-sampler from cluster-over-sampling that does not use clustering,
i.e. when clusterer=None
, is equivalent to the corresponding
imbalanced-learn over-sampler:
>>> import numpy as np
>>> from imblearn.over_sampling import BorderlineSMOTE
>>> X_res_im, y_res_im = BorderlineSMOTE(random_state=5).fit_resample(X, y)
>>> from clover.over_sampling import BorderlineSMOTE
>>> X_res_cl, y_res_cl = BorderlineSMOTE(random_state=5).fit_resample(X, y)
>>> np.testing.assert_equal(X_res_im, X_res_cl)
>>> np.testing.assert_equal(y_res_im, y_res_cl)
References
[DB2017] | Douzas, G., & Bacao, F. (2019). “Self-Organizing Map Oversampling for imbalanced data set learning”, Expert Systems with Applications, 82, 40-52. https://doi.org/10.1016/j.eswa.2017.03.073 |
[DB2018] | Douzas, G., & Bacao, F. (2019). “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE”, Information Sciences, 465, 1-20. https://doi.org/10.1016/j.ins.2018.06.056 |
[CBHK2002] | N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique”, Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002. |