pyemb package

Submodules

pyemb.preprocessing module

pyemb.preprocessing.find_connected_components(A, attributes, n_components=None)

Find connected components of a multipartite graph.

Parameters:
  • A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

  • n_components (int) – The number of components to be found.

Returns:

  • cc_As (list of scipy.sparse.csr_matrix) – The adjacency matrices of the connected components.

  • cc_attributes (list of lists) – The attributes of the nodes of the connected components. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.find_subgraph(A, attributes, subgraph_attributes)

Find a subgraph of a multipartite graph.

Parameters:
  • A (scipy.sparse.csr_matrix) – The adjacency matrix of the multipartite graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

  • subgraph_attributes (list of lists) – The attributes of the nodes of the wanted in the subgraph. The first list contains the attributes of the nodes wanted in the rows. The second list contains the attributes of the nodes wanted in the column.

Returns:

  • subgraph_A (scipy.sparse.csr_matrix) – The adjacency matrix of the subgraph.

  • subgraph_attributes (list of lists) – The attributes of the nodes of the subgraph. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.graph_from_dataframes(tables, relationship_cols, same_attribute=False, dynamic_col=None, weight_col=None, join_token='::')

Create a graph from a list of tables and relationships.

Parameters:
  • tables (list of pandas.DataFrame) – The list of tables.

  • relationship_cols (list of lists) – The list of relationships. Either: Each relationship is a list of two lists, each of which contains the names of the columns in the corresponding table. Or, a list of lists and each pair is looked for in each table.

  • same_attribute (bool) – Whether the entities in the columns are from the same attribute.

  • dynamic_col (list of str) – The list of dynamic columns.

  • weight_col (list of str) – The list of weight columns.

  • join_token (str) – The token used to join the names of the partitions and the names of the nodes.

Returns:

  • A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.largest_cc_of(A, attributes, partition, dynamic=False)

Find the connected component containing the most nodes from a partition.

Parameters:
  • A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

  • partition (str) – The partition to be searched.

  • dynamic (bool) – Whether we want the connected component containing the most nodes from dynamic part or not.

Returns:

  • cc_A (scipy.sparse.csr_matrix) – The adjacency matrix of the connected component.

  • cc_attributes (list of lists) – The attributes of the nodes of the connected component. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.text_matrix_and_attributes(data, column_name, remove_stopwords=True, clean_text=True, remove_email_addresses=False, update_stopwords=None, **kwargs)

Create a matrix from a column of text data.

Parameters:
  • data (pandas.DataFrame) – The data to be used to create the matrix.

  • column_name (str) – The name of the column containing the text data.

  • remove_stopwords (bool) – Whether to remove stopwords.

  • clean_text (bool) – Whether to clean the text data.

  • remove_email_addresses (bool) – Whether to remove email addresses.

  • update_stopwords (list of str) – The list of additional stopwords to be removed.

  • kwargs (dict) – Other arguments to be passed to sklearn.feature_extraction.text.TfidfVectorizer.

Returns:

  • Y (numpy.ndarray) – The matrix created from the text data.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.time_series_matrix_and_attributes(data, time_col, drop_nas=True)

Create a matrix from a time series.

Parameters:
  • data (pandas.DataFrame) – The data to be used to create the matrix.

  • time_col (str) – The name of the column containing the time information.

  • drop_nas (bool) – Whether to drop rows with missing values.

Returns:

  • Y (numpy.ndarray) – The matrix created from the time series.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

pyemb.preprocessing.to_networkx(A, attributes, symmetric=None)

Convert a multipartite graph to a networkx graph.

pyemb.embedding module

pyemb.embedding.ISE(As, d, flat=True, procrustes=False, consistent_orientation=True)

Computes the spectral embedding (ISE) for each adjacency snapshot.

Parameters:
  • As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).

  • d (int) – Embedding dimension.

  • flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

  • procrustes (bool, optional) – Whether to align each embedding with the previous embedding. Default is False.

  • consistent_orientation (bool, optional) – Whether to ensure the eigenvector orientation is consistent. Default is True.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

pyemb.embedding.OMNI(As, d, flat=True, sparse_matrix=False)

Computes the omnibus dynamic spectral embedding. For more details, see: https://arxiv.org/abs/1705.09355

Parameters:
  • As (numpy.ndarray) – Adjacency matrices of shape (T, n, n).

  • d (int) – Embedding dimension.

  • flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

  • sparse_matrix (bool, optional) – Whether to use sparse matrices. Default is False.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

pyemb.embedding.UASE(As, d, flat=True, sparse_matrix=False, return_left=False)

Computes the unfolded adjacency spectral embedding (UASE). For more details, see: https://arxiv.org/abs/2007.10455 https://arxiv.org/abs/2106.01282

Parameters:
  • As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).

  • d (int) – Embedding dimension.

  • flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

  • sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.

  • return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.

Returns:

  • numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).

  • numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.

pyemb.embedding.dyn_embed(As, d=50, method='UASE', regulariser='auto', flat=True)

Computes the dynamic embedding using a specified method.

Parameters:
  • As (numpy.ndarray or list) – An adjacency matrix series which is either a numpy array of shape (T, n, n), a list of numpy arrays of shape (n, n), or a series of CSR matrices.

  • d (int, optional) – Embedding dimension. Default is 50.

  • method (str, optional) – The embedding method to use. Options are “ISE”, “ISE PROCRUSTES”, “UASE”, “OMNI”, “ULSE”, “URLSE”, “RANDOM”. Default is “UASE”.

  • regulariser (float or "auto", optional) – Regularisation parameter for the Laplacian matrix. If “auto”, the regulariser is set to the average node degree. Default is “auto”.

  • flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

Raises:

Exception – If the specified method is not recognized.

pyemb.embedding.eigen_decomp(A, dim=None)

Perform eigenvalue decomposition of a matrix.

Parameters:
  • A (numpy.ndarray) – The matrix to be decomposed.

  • dim (int) – The number of dimensions to be returned.

Returns:

  • eigenvalues (numpy.ndarray) – The eigenvalues.

  • eigenvectors (numpy.ndarray) – The eigenvectors.

pyemb.embedding.embed(Y, d=50, version='sqrt', return_right=False, flat=True, make_laplacian=False, regulariser=0)

Embed a matrix.

Parameters:
  • Y (numpy.ndarray or list of numpy.ndarray) – The matrix to embed.

  • d (int) – The number of dimensions to embed into.

  • version (str) – The version of the embedding. Options are ‘full’ or ‘sqrt’ (default).

  • return_right (bool) – Whether to return the right embedding.

  • flat (bool) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d).

  • make_laplacian (bool) – Whether to use the Laplacian matrix.

  • regulariser (float) – The regulariser to be added to the degrees of the nodes. (only used if make_laplacian=True)

Returns:

  • left_embedding (numpy.ndarray) – The left embedding.

  • right_embedding (numpy.ndarray) – The right embedding.

pyemb.embedding.regularised_ULSE(As, d, regulariser='auto', flat=True, sparse_matrix=False, return_left=False)

Computes the regularised unfolded Laplacian spectral embedding (regularised ULSE).

Parameters:
  • As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).

  • d (int) – Embedding dimension.

  • regulariser (float, optional) – Regularisation parameter for the Laplacian matrix. By default, this is the average node degree.

  • flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

  • sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.

  • return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.

Returns:

  • numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).

  • numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.

pyemb.embedding.wasserstein_dimension_select(Y, dims, split=0.5)

Select the number of dimensions using Wasserstein distances.

Parameters:
  • Y (numpy.ndarray) – The array of matrices.

  • dims (list of int) – The dimensions to be considered.

  • split (float) – The proportion of the data to be used for training.

Returns:

ws – The Wasserstein distances between the training and test data for each number of dimensions. The dimension recommended is the one with the smallest Wasserstein distance.

Return type:

list of numpy.ndarray

pyemb.tools module

pyemb.tools.degree_correction(embedding)

Perform degree correction.

Parameters:

embedding (numpy.ndarray) – The embedding of the graph, either 2D or 3D.

Returns:

embedding_dc – The degree-corrected embedding.

Return type:

numpy.ndarray

pyemb.tools.recover_subspaces(embedding, attributes)

Recover the subspaces for each partition from an embedding.

Parameters:
  • embedding (numpy.ndarray) – The embedding of the graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

Returns:

  • partition_embeddings (dict) – The embeddings of the partitions.

  • partition_attributes (dict) – The attributes of the nodes in the partitions.

pyemb.tools.select(embedding, attributes, select_attributes)

Select portion of embedding and attributes associated with a set of attributes.

Parameters:
  • embedding (numpy.ndarray) – The embedding of the graph.

  • attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

  • select_attributes (dict or list of dicts) – The attributes to select by. If a list of dicts is provided, the intersection of the nodes satisfying each dict is selected.

Returns:

  • selected_X (numpy.ndarray) – The selected embedding.

  • selected_attributes (list of lists) – The attributes of the selected nodes.

pyemb.tools.to_laplacian(A, regulariser=0)

Convert an adjacency matrix to a Laplacian matrix.

Parameters:
  • A (scipy.sparse.csr_matrix) – The adjacency matrix.

  • regulariser (float) – The regulariser to be added to the degrees of the nodes. If ‘auto’, the regulariser is set to the mean of the degrees.

Returns:

L – The Laplacian matrix.

Return type:

scipy.sparse.csr_matrix

pyemb.tools.varimax(Phi, gamma=1, q=20, tol=1e-06)

Perform varimax rotation.

Parameters:
  • Phi (numpy.ndarray) – The matrix to rotate.

  • gamma (float, optional) – The gamma parameter.

  • q (int, optional) – The number of iterations.

  • tol (float, optional) – The tolerance.

Returns:

The rotated matrix.

Return type:

numpy.ndarray

pyemb.plotting module

pyemb.plotting.quick_plot(embedding, n, T=1, node_labels=None, **kwargs)

Produces an interactive plot an embedding. If the embedding is dynamic (i.e. T > 1), then the embedding will be animated over time.

Parameters:
  • embedding (numpy.ndarray (n*T, d) or (T, n, d)) – The dynamic embedding.

  • n (int) – The number of nodes.

  • T (int) – The number of time points (> 1 animates the embedding).

  • node_labels (list of length n) – The labels of the nodes (time-invariant).

  • return_df (bool (optional)) – Option to return the plotting dataframe.

  • title (str (optional)) – The title of the plot.

pyemb.plotting.snapshot_plot(embedding, n, node_labels, points_of_interest, point_labels=[], max_cols=4, add_legend=False, legend_adjust=0, max_legend_cols=5, **kwargs)

Plots the selected embedding snapshots as a grid of scatter plots.

Parameters:
  • embedding (numpy.ndarray (T, n, d) or (n*T, d)) – The dynamic embedding.

  • n (int) – The number of nodes.

  • node_labels (list of length n) – The labels of the nodes (time-invariant).

  • points_of_interest (list of int) – The time point indices to plot.

  • point_labels (list of str (optional)) – The labels of the points of interest.

  • max_cols (int (optional)) – The maximum number of columns in the scatter plot grid.

Returns:

fig – The figure object.

Return type:

matplotlib.figure.Figure

pyemb.hc module

class pyemb.hc.ConstructTree(point_cloud=None, model=None, epsilon=0.25)

Bases: object

Construct a condensed tree from a hierarchical clustering model.

Parameters:
  • model (AgglomerativeClustering, optional) – The fitted model.

  • point_cloud (ndarray, optional) – The data points.

  • epsilon (float, optional) – The threshold for condensing the tree.

  • **kwargs (dict, optional) – Additional keyword arguments.

model

The fitted model.

Type:

AgglomerativeClustering

point_cloud

The data points.

Type:

ndarray

epsilon

The threshold for condensing the tree.

Type:

float

linkage

The linkage matrix.

Type:

ndarray

tree

The condensed tree.

Type:

nx.Graph

collapsed_branches

The collapsed branches.

Type:

dict

fit(**kwargs)

Fit the condensed tree.

plot(labels=None, colours=None, colour_threshold=0.5, prog='sfdp', forceatlas_iter=250, node_size=10, scaling_node_size=1, **kwargs)

Plot the condensed tree.

class pyemb.hc.DotProductAgglomerativeClustering(metric='dot_product', linkage='average', distance_threshold=0, n_clusters=None)

Bases: object

Perform hierarchical clustering using dot product as the metric.

Parameters:
  • metric (str, optional) – The metric to use for clustering.

  • linkage (str, optional) – The linkage criterion to use.

  • distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged.

  • n_clusters (int, optional) – The number of clusters to find.

distances_

The distances between the clusters.

Type:

ndarray

children_

The children of each non-leaf node.

Type:

ndarray

labels_

The labels of each point.

Type:

ndarray

n_clusters_

The number of clusters.

Type:

int

n_connected_components_

The number of connected components.

Type:

int

n_leaves_

The number of leaves.

Type:

int

n_features_in_

The number of features seen during fit.

Type:

int

n_clusters_

The number of clusters.

Type:

int

fit(X)
pyemb.hc.branch_lengths(Z, point_cloud=None)

Calculate branch lengths for a hierarchical clustering dendrogram.

Parameters:
  • (ndarray) (point_cloud)

  • (ndarray)

Returns:

ndarray

Return type:

Matrix of branch lengths.

pyemb.hc.cophenetic_distances(Z)

Calculate the cophenetic distances between each observation and internal nodes.

Parameters:

Z (ndarray) – The linkage matrix.

Returns:

d – The full distance matrix (2n-1) x (2n-1).

Return type:

ndarray

pyemb.hc.find_descendents(Z, node, desc=None, just_leaves=True)

Find all descendants of a given node in a hierarchical clustering tree.

Parameters:
  • (ndarray) (Z)

  • (int) (node)

  • (dict (desc)

  • optional) (Whether to include only leaf nodes.)

  • (bool (just_leaves)

  • optional)

Returns:

list

Return type:

List of descendants.

pyemb.hc.get_ranking(model)

Get the ranking of the samples.

Parameters:

model (AgglomerativeClustering) – The fitted model.

Returns:

mh_rank – The ranking of the samples.

Return type:

numpy.ndarray

pyemb.hc.kendalltau_similarity(model, true_ranking)

Calculate the Kendall’s tau similarity between the model and true ranking.

Parameters:
  • model (AgglomerativeClustering) – The fitted model.

  • true_ranking (array-like, shape (n_samples, n_samples)) – The true ranking of the samples.

Returns:

The mean Kendall’s tau similarity between the model and true ranking.

Return type:

float

pyemb.hc.linkage_matrix(model)

Convert a hierarchical clustering model to a linkage matrix.

Parameters:
  • model (AgglomerativeClustering) – The fitted model.

  • get_heights (bool, optional) – Whether to return heights or counts.

  • max_height (float, optional) – The maximum height of the tree.

Returns:

The linkage matrix.

Return type:

ndarray

pyemb.hc.plot_dendrogram(model, dot_product_clustering=True, rescale=False, **kwargs)

Create linkage matrix and then plot the dendrogram

Parameters:
  • model (AgglomerativeClustering) – The fitted model to plot.

  • **kwargs (dict) – Keyword arguments for dendrogram function.

Return type:

None

pyemb.hc.sample_hyperbolicity(data, metric='dot_products', num_samples=5000)

Calculate the hyperbolicity of the data.

Parameters:
  • data (numpy.ndarray) – The data to calculate the hyperbolicity.

  • metric (str) – The metric to use. Options are ‘dot_products’, ‘cosine_similarity’, ‘precomputed’ or any metric supported by scikit-learn.

  • num_samples (int) – The number of samples to calculate the hyperbolicity.

Returns:

The hyperbolicity of the data.

Return type:

float

pyemb.simulation module

pyemb.simulation.SBM(n=200, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))
pyemb.simulation.iid_SBM(n=200, T=2, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))
pyemb.simulation.symmetrises(A, diag=False)

Module contents