PKBC#

class QuadratiK.spherical_clustering.PKBC(num_clust, max_iter=300, stopping_rule='loglik', init_method='sampledata', num_init=10, tol=1e-07, random_state=None, n_jobs=4)#

Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.

Parameters#

num_clustint

Number of clusters.

max_iterint

Maximum number of iterations before a run is terminated.

stopping_rulestr, optional

String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’.

init_methodstr, optional

String describing the initialization method to be used. Currently must be ‘sampleData’.

num_initint, optional

Number of initializations.

tolfloat.

Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.

random_stateint, None, optional.

Seed for random number generation. Defaults to None

n_jobsint

Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.

Attributes#

alpha_numpy.ndarray of shape (n_clusters,)

Estimated mixing proportions

labels_numpy.ndarray of shape (n_samples,)

Final cluster membership assigned by the algorithm to each observation

log_lik_vecnumpy.ndarray of shape (num_init, )

Array of log-likelihood values for each initialization

loglik_float

Maximum value of the log-likelihood function

mu_numpy.ndarray of shape (n_clusters, n_features)

Estimated centroids

num_iter_per_runnumpy.ndarray of shape (num_init, )

Number of E-M iterations per run

post_probs_numpy.ndarray of shape (n_samples, n_features)

Posterior probabilities of each observation for the indicated clusters

rho_numpy.ndarray of shape (n_clusters,)

Estimated concentration parameters rho

euclidean_wcss_float

Values of within-cluster sum of squares computed with Euclidean distance.

cosine_wcss_float

Values of within-cluster sum of squares computed with cosine similarity.

References#

Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.

Examples#

>>> from QuadratiK.datasets import load_wireless_data
>>> from QuadratiK.spherical_clustering import PKBC
>>> from sklearn.preprocessing import LabelEncoder
>>> X, y = load_wireless_data(return_X_y=True)
>>> le = LabelEncoder()
>>> le.fit(y)
>>> y = le.transform(y)
>>> cluster_fit = PKBC(num_clust=4, random_state=42).fit(X)
>>> ari, macro_precision, macro_recall, avg_silhouette_Score = cluster_fit.validation(y)
>>> print("Estimated mixing proportions :", cluster_fit.alpha_)
>>> print("Estimated concentration parameters: ", cluster_fit.rho_)
>>> print("Adjusted Rand Index:", ari)
>>> print("Macro Precision:", macro_precision)
>>> print("Macro Recall:", macro_recall)
>>> print("Average Silhouette Score:", avg_silhouette_Score)
... Estimated mixing proportions : [0.23590339 0.24977919 0.25777522 0.25654219]
... Estimated concentration parameters:  [0.97773265 0.98348976 0.98226901 0.98572597]
... Adjusted Rand Index: 0.9403086353805835
... Macro Precision: 0.9771870612442508
... Macro Recall: 0.9769999999999999
... Average Silhouette Score: 0.3803089203572107

Methods

PKBC.fit(dat)

Performs Poisson Kernel-based Clustering.

PKBC.predict(X)

Predict the cluster membership for each sample in X.

PKBC.stats()

Function to generate descriptive statistics per variable (and per group if available).

PKBC.validation([y_true])

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.


PKBC.fit(dat)#

Performs Poisson Kernel-based Clustering.

Parameters#

datnumpy.ndarray, pandas.DataFrame

A numeric array of data values.

Returns#

selfobject

Fitted estimator

PKBC.predict(X)#

Predict the cluster membership for each sample in X.

Parameters#

Xnumpy.ndarray, pandas.DataFrame

New data to predict membership

Returns#

(Cluster Probabilities, Membership)tuple

The first element of the tuple is the cluster probabilities of the input samples. The second element of the tuple is the predicted cluster membership of the new data.

PKBC.stats()#

Function to generate descriptive statistics per variable (and per group if available).

Returns#

summary_stats_dfpandas.DataFrame

Dataframe of descriptive statistics

PKBC.validation(y_true=None)#

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.

Parameters#

y_truenumpy.ndarray.

Array of true memberships to clusters, Defaults to None.

Returns#

validation metricstuple

The tuple consists of the following:

  • Adjusted Rand Indexfloat (returned only when y_true is provided)

    Adjusted Rand Index computed between the true and predicted cluster memberships.

  • Macro Precisionfloat (returned only when y_true is provided)

    Macro Precision computed between the true and predicted cluster memberships.

  • Macro Recallfloat (returned only when y_true is provided)

    Macro Recall computed between the true and predicted cluster memberships.

  • Average Silhouette Scorefloat

    Mean Silhouette Coefficient of all samples.

References#

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

Notes#

We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.

See also#

sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.