clover.over_sampling.ClusterOverSampler

class clover.over_sampling.ClusterOverSampler(oversampler, clusterer=None, distributor=None, raise_error=True, random_state=None, n_jobs=None)[source]

A class that handles clustering-based over-sampling.

Any combination of over-sampler, clusterer and distributor can be used.

Read more in the user guide.

Parameters:
oversampler : oversampler estimator, default=None

Over-sampler to apply to each selected cluster.

clusterer : clusterer estimator, default=None

Clusterer to apply to input space before over-sampling.

  • When None, it corresponds to a clusterer that assigns a single cluster to all the samples i.e. no clustering is applied.
  • When clusterer, it applies clustering to the input space. Then over-sampling is applied inside each cluster and between clusters.
distributor : distributor estimator, default=None

Distributor to distribute the generated samples per cluster label.

  • When None and a clusterer is provided then it corresponds to the density distributor. If clusterer is also None than the distributor does not affect the over-sampling procedure.
  • When distributor object is provided, it is used to distribute the generated samples to the clusters.
raise_error : bool, default=True

Raise an error when no samples are generated.

  • If True, it raises an error when no filtered clusters are identified and therefore no samples are generated.
  • If False, it displays a warning.
random_state : int, RandomState instance, default=None

Control the randomization of the algorithm.

  • If int, random_state is the seed used by the random number generator;
  • If RandomState instance, random_state is the random number generator;
  • If None, the random number generator is the RandomState instance used by np.random.
n_jobs : int, default=None

Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Examples

>>> from collections import Counter
>>> from clover.over_sampling import ClusterOverSampler
>>> from sklearn.datasets import make_classification
>>> from sklearn.cluster import KMeans
>>> from imblearn.over_sampling import SMOTE
>>> X, y = make_classification(random_state=0, n_classes=2, weights=[0.9, 0.1])
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({0: 90, 1: 10})
>>> cluster_oversampler = ClusterOverSampler(
... oversampler=SMOTE(random_state=5),
... clusterer=KMeans(random_state=10))
>>> X_res, y_res = cluster_oversampler.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 90, 1: 90})
Attributes:
clusterer_ : object

A fitted clone of the clusterer parameter or None when a clusterer is not given.

distributor_ : object

A fitted clone of the clusterer parameter or a fitted instance of the BaseDistributor when a distributor is not given.

labels_ : array, shape (n_samples,)

Labels of each sample.

neighbors_ : array, (n_neighboring_pairs, 2) or None

An array that contains all neighboring pairs with each row being a unique neighboring pair. It is None when the clusterer does not support this attribute.

oversampler_ : object

A fitted clone of the oversampler parameter.

random_state_ : object

An instance of RandomState class.

sampling_strategy_ : dict

Actual sampling strategy.

__init__(self, oversampler, clusterer=None, distributor=None, raise_error=True, random_state=None, n_jobs=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(self, X, y)[source]

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Data array.

y : array-like of shape (n_samples,)

Target array.

Returns:
self : object

Return the instance itself.

fit_resample(self, X, y, **fit_params)[source]

Resample the dataset.

Parameters:
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like of shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like of shape (n_samples_new,)

The corresponding label of X_resampled.

fit_sample(self, X, y)

Resample the dataset.

Parameters:
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like of shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like of shape (n_samples_new,)

The corresponding label of X_resampled.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**params : dict

Estimator parameters.

Returns:
self : object

Estimator instance.

Examples using clover.over_sampling.ClusterOverSampler