clover.distribution
.DensityDistributor¶
-
class
clover.distribution.
DensityDistributor
(filtering_threshold='auto', distances_exponent='auto', sparsity_based=True, distribution_ratio=1.0)[source]¶ Class to perform density based distribution.
Samples are distributed based on the density of clusters.
Read more in the user guide.
Parameters: - filtering_threshold : float or ‘auto’, default=’auto’
The threshold of a filtered cluster. It can be any non-negative number or
'auto'
to be calculated automatically.- If
'auto'
, the filtering threshold is calculated from the imbalance ratio of the target for the binary case or the maximum of the target’s imbalance ratios for the multiclass case. - If
float
then it is manually set to this number.
Any cluster that has an imbalance ratio smaller than the filtering threshold is identified as a filtered cluster and can be potentially used to generate minority class instances. Higher values increase the number of filtered clusters.
- If
- distances_exponent : float or ‘auto’, default=’auto’
The exponent of the mean distance in the density calculation. It can be any non-negative number or
'auto'
to be calculated automatically.- If
'auto'
then it is set equal to the number of features. Higher values make the calculation of density more sensitive to the cluster’s size i.e. clusters with large mean euclidean distance between samples are penalized. - If
float
then it is manually set to this number.
- If
- sparsity_based : bool, default=True
Whether sparse clusters receive more generated samples.
- When
True
clusters receive generated samples that are inversely proportional to their density. - When
False
clusters receive generated samples that are proportional to their density.
- When
- distribution_ratio : float, default=1.0
The ratio of intra-cluster to inter-cluster generated samples. It is a number in the
range. The default value is
1.0
, a case corresponding to only intra-cluster generation. As the number decreases, less intra-cluster samples are generated. Inter-cluster generation, i.e. whendistribution_ratio
is less than1.0
, requires a neighborhood structure for the clusters, i.e. aneighbors_
attribute should be created after fitting and it will raise an error when it is not found.
Examples
>>> from clover.distribution import DensityDistributor >>> from sklearn.datasets import load_iris >>> from sklearn.cluster import KMeans >>> X, y = load_iris(return_X_y=True) >>> labels = KMeans(random_state=0).fit_predict(X, y) >>> density_distributor = DensityDistributor().fit(X, y, labels) >>> density_distributor.filtered_clusters_ [(7, 1), (4, 1), (3, 1), (1, 1), (6, 2), (1, 2), (2, 2), (4, 2)] >>> density_distributor.intra_distribution_ {(7, 1): 0.066096796165... (4, 2): 0.0911085147...} >>> density_distributor.inter_distribution_ {}
Attributes: - clusters_density_ : dict
Each dict key is a multi-label tuple of shape
(cluster_label, class_label)
, while the values correspond to the density.- distances_exponent_ : float
Actual exponent of the mean distance used in the calculations.
- distribution_ratio_ : float
A copy of the parameter in the constructor.
- filtered_clusters_ : list
Each element is a tuple of
(cluster_label, class_label)
pairs.- filtering_threshold_ : float
Actual filtering threshold used in the calculations.
- inter_distribution_ : dict
Each dict key is a multi-label tuple of shape
((cluster_label1, cluster_label2), class_label)
.- intra_distribution_ : dict
Each dict key is a multi-label tuple of shape
(cluster_label, class_label)
.- labels_ : array, shape (n_samples,)
Labels of each sample.
- majority_class_label_ : int
The majority class label.
- n_samples_ : int
The number of samples.
- neighbors_ : array, (n_neighboring_pairs, 2)
An array that contains all neighboring pairs. Each row is a unique neighboring pair.
- sparsity_based_ : bool
A copy of the parameter in the constructor.
- unique_class_labels_ : array, shape (n_classes, )
An array of unique class labels.
- unique_cluster_labels_ : array, shape (n_clusters, )
An array of unique cluster labels.
-
__init__
(self, filtering_threshold='auto', distances_exponent='auto', sparsity_based=True, distribution_ratio=1.0)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(self, X, y, labels=None, neighbors=None)¶ Generate the intra-label and inter-label distribution.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
- labels : array-like, shape (n_samples,)
Labels of each sample.
- neighbors : array-like, (n_neighboring_pairs, 2)
An array that contains all neighboring pairs. Each row is a unique neighboring pair.
Returns: - self : object,
Return self.
-
fit_distribute
(self, X, y, labels=None, neighbors=None)¶ Return the intra-label and inter-label distribution.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
- labels : array-like, shape (n_samples,)
Labels of each sample.
- neighbors : array-like, shape (n_neighboring_pairs, 2)
An array that contains all neighboring pairs. Each row is a unique neighboring pair.
Returns: - distributions : tuple of (intra_distribution, inter_distribution) arrays
A tuple with the two distributions.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Parameters: - **params : dict
Estimator parameters.
Returns: - self : object
Estimator instance.