PKBC#
- class QuadratiK.spherical_clustering.PKBC(num_clust, max_iter=300, stopping_rule='loglik', init_method='sampledata', num_init=10, tol=1e-07, random_state=None, n_jobs=4)#
Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.
Parameters#
- num_clustint
Number of clusters.
- max_iterint
Maximum number of iterations before a run is terminated.
- stopping_rulestr, optional
String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’.
- init_methodstr, optional
String describing the initialization method to be used. Currently must be ‘sampleData’.
- num_initint, optional
Number of initializations.
- tolfloat.
Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.
- random_stateint, None, optional.
Seed for random number generation. Defaults to None
- n_jobsint
Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.
Attributes#
- alpha_numpy.ndarray of shape (n_clusters,)
Estimated mixing proportions
- labels_numpy.ndarray of shape (n_samples,)
Final cluster membership assigned by the algorithm to each observation
- log_lik_vecnumpy.ndarray of shape (num_init, )
Array of log-likelihood values for each initialization
- loglik_float
Maximum value of the log-likelihood function
- mu_numpy.ndarray of shape (n_clusters, n_features)
Estimated centroids
- num_iter_per_runnumpy.ndarray of shape (num_init, )
Number of E-M iterations per run
- post_probs_numpy.ndarray of shape (n_samples, n_features)
Posterior probabilities of each observation for the indicated clusters
- rho_numpy.ndarray of shape (n_clusters,)
Estimated concentration parameters rho
- euclidean_wcss_float
Values of within-cluster sum of squares computed with Euclidean distance.
- cosine_wcss_float
Values of within-cluster sum of squares computed with cosine similarity.
References#
Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Examples#
>>> from QuadratiK.datasets import load_wireless_data >>> from QuadratiK.spherical_clustering import PKBC >>> from sklearn.preprocessing import LabelEncoder >>> X, y = load_wireless_data(return_X_y=True) >>> le = LabelEncoder() >>> le.fit(y) >>> y = le.transform(y) >>> cluster_fit = PKBC(num_clust=4, random_state=42).fit(X) >>> ari, macro_precision, macro_recall, avg_silhouette_Score = cluster_fit.validation(y) >>> print("Estimated mixing proportions :", cluster_fit.alpha_) >>> print("Estimated concentration parameters: ", cluster_fit.rho_) >>> print("Adjusted Rand Index:", ari) >>> print("Macro Precision:", macro_precision) >>> print("Macro Recall:", macro_recall) >>> print("Average Silhouette Score:", avg_silhouette_Score) ... Estimated mixing proportions : [0.23590339 0.24977919 0.25777522 0.25654219] ... Estimated concentration parameters: [0.97773265 0.98348976 0.98226901 0.98572597] ... Adjusted Rand Index: 0.9403086353805835 ... Macro Precision: 0.9771870612442508 ... Macro Recall: 0.9769999999999999 ... Average Silhouette Score: 0.3803089203572107
Methods
|
Performs Poisson Kernel-based Clustering. |
|
Predict the cluster membership for each sample in X. |
Function to generate descriptive statistics per variable (and per group if available). |
|
|
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided. |
- PKBC.fit(dat)#
Performs Poisson Kernel-based Clustering.
Parameters#
- datnumpy.ndarray, pandas.DataFrame
A numeric array of data values.
Returns#
- selfobject
Fitted estimator
- PKBC.predict(X)#
Predict the cluster membership for each sample in X.
Parameters#
- Xnumpy.ndarray, pandas.DataFrame
New data to predict membership
Returns#
- (Cluster Probabilities, Membership)tuple
The first element of the tuple is the cluster probabilities of the input samples. The second element of the tuple is the predicted cluster membership of the new data.
- PKBC.stats()#
Function to generate descriptive statistics per variable (and per group if available).
Returns#
- summary_stats_dfpandas.DataFrame
Dataframe of descriptive statistics
- PKBC.validation(y_true=None)#
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.
Parameters#
- y_truenumpy.ndarray.
Array of true memberships to clusters, Defaults to None.
Returns#
- validation metricstuple
The tuple consists of the following:
- Adjusted Rand Indexfloat (returned only when y_true is provided)
Adjusted Rand Index computed between the true and predicted cluster memberships.
- Macro Precisionfloat (returned only when y_true is provided)
Macro Precision computed between the true and predicted cluster memberships.
- Macro Recallfloat (returned only when y_true is provided)
Macro Recall computed between the true and predicted cluster memberships.
- Average Silhouette Scorefloat
Mean Silhouette Coefficient of all samples.
References#
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Notes#
We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.
See also#
sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.