clover.over_sampling
.ClusterOverSampler¶
-
class
clover.over_sampling.
ClusterOverSampler
(oversampler, clusterer=None, distributor=None, raise_error=True, random_state=None, n_jobs=None)[source]¶ A class that handles clustering-based over-sampling.
Any combination of over-sampler, clusterer and distributor can be used.
Read more in the user guide.
Parameters: - oversampler : oversampler estimator, default=None
Over-sampler to apply to each selected cluster.
- clusterer : clusterer estimator, default=None
Clusterer to apply to input space before over-sampling.
- When
None
, it corresponds to a clusterer that assigns a single cluster to all the samples i.e. no clustering is applied. - When clusterer, it applies clustering to the input space. Then over-sampling is applied inside each cluster and between clusters.
- When
- distributor : distributor estimator, default=None
Distributor to distribute the generated samples per cluster label.
- When
None
and a clusterer is provided then it corresponds to the density distributor. If clusterer is alsoNone
than the distributor does not affect the over-sampling procedure. - When distributor object is provided, it is used to distribute the generated samples to the clusters.
- When
- raise_error : bool, default=True
Raise an error when no samples are generated.
- If
True
, it raises an error when no filtered clusters are identified and therefore no samples are generated. - If
False
, it displays a warning.
- If
- random_state : int, RandomState instance, default=None
Control the randomization of the algorithm.
- If int,
random_state
is the seed used by the random number generator; - If
RandomState
instance, random_state is the random number generator; - If
None
, the random number generator is theRandomState
instance used bynp.random
.
- If int,
- n_jobs : int, default=None
Number of CPU cores used during the cross-validation loop.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.
Examples
>>> from collections import Counter >>> from clover.over_sampling import ClusterOverSampler >>> from sklearn.datasets import make_classification >>> from sklearn.cluster import KMeans >>> from imblearn.over_sampling import SMOTE >>> X, y = make_classification(random_state=0, n_classes=2, weights=[0.9, 0.1]) >>> print('Original dataset shape %s' % Counter(y)) Original dataset shape Counter({0: 90, 1: 10}) >>> cluster_oversampler = ClusterOverSampler( ... oversampler=SMOTE(random_state=5), ... clusterer=KMeans(random_state=10)) >>> X_res, y_res = cluster_oversampler.fit_resample(X, y) >>> print('Resampled dataset shape %s' % Counter(y_res)) Resampled dataset shape Counter({0: 90, 1: 90})
Attributes: - clusterer_ : object
A fitted clone of the
clusterer
parameter orNone
when a clusterer is not given.- distributor_ : object
A fitted clone of the
clusterer
parameter or a fitted instance of theBaseDistributor
when a distributor is not given.- labels_ : array, shape (n_samples,)
Labels of each sample.
- neighbors_ : array, (n_neighboring_pairs, 2) or None
An array that contains all neighboring pairs with each row being a unique neighboring pair. It is
None
when the clusterer does not support this attribute.- oversampler_ : object
A fitted clone of the
oversampler
parameter.- random_state_ : object
An instance of
RandomState
class.- sampling_strategy_ : dict
Actual sampling strategy.
-
__init__
(self, oversampler, clusterer=None, distributor=None, raise_error=True, random_state=None, n_jobs=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(self, X, y)[source]¶ Check inputs and statistics of the sampler.
You should use
fit_resample
in all cases.Parameters: - X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Data array.
- y : array-like of shape (n_samples,)
Target array.
Returns: - self : object
Return the instance itself.
-
fit_resample
(self, X, y, **fit_params)[source]¶ Resample the dataset.
Parameters: - X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like of shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like of shape (n_samples_new,)
The corresponding label of X_resampled.
-
fit_sample
(self, X, y)¶ Resample the dataset.
Parameters: - X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like of shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like of shape (n_samples_new,)
The corresponding label of X_resampled.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Parameters: - **params : dict
Estimator parameters.
Returns: - self : object
Estimator instance.