API
Preprocessing
- pyemb.preprocessing.find_connected_components(A, attributes, n_components=None)
Find connected components of a multipartite graph.
- Parameters:
A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
n_components (int) – The number of components to be found.
- Returns:
cc_As (list of scipy.sparse.csr_matrix) – The adjacency matrices of the connected components.
cc_attributes (list of lists) – The attributes of the nodes of the connected components. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.find_subgraph(A, attributes, subgraph_attributes)
Find a subgraph of a multipartite graph.
- Parameters:
A (scipy.sparse.csr_matrix) – The adjacency matrix of the multipartite graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
subgraph_attributes (list of lists) – The attributes of the nodes of the wanted in the subgraph. The first list contains the attributes of the nodes wanted in the rows. The second list contains the attributes of the nodes wanted in the column.
- Returns:
subgraph_A (scipy.sparse.csr_matrix) – The adjacency matrix of the subgraph.
subgraph_attributes (list of lists) – The attributes of the nodes of the subgraph. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.graph_from_dataframes(tables, relationship_cols, same_attribute=False, dynamic_col=None, weight_col=None, join_token='::')
Create a graph from a list of tables and relationships.
- Parameters:
tables (list of pandas.DataFrame) – The list of tables.
relationship_cols (list of lists) – The list of relationships. Either: Each relationship is a list of two lists, each of which contains the names of the columns in the corresponding table. Or, a list of lists and each pair is looked for in each table.
same_attribute (bool) – Whether the entities in the columns are from the same attribute.
dynamic_col (list of str) – The list of dynamic columns.
weight_col (list of str) – The list of weight columns.
join_token (str) – The token used to join the names of the partitions and the names of the nodes.
- Returns:
A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.largest_cc_of(A, attributes, partition, dynamic=False)
Find the connected component containing the most nodes from a partition.
- Parameters:
A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
partition (str) – The partition to be searched.
dynamic (bool) – Whether we want the connected component containing the most nodes from dynamic part or not.
- Returns:
cc_A (scipy.sparse.csr_matrix) – The adjacency matrix of the connected component.
cc_attributes (list of lists) – The attributes of the nodes of the connected component. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.text_matrix_and_attributes(data, column_name, remove_stopwords=True, clean_text=True, remove_email_addresses=False, update_stopwords=None, **kwargs)
Create a matrix from a column of text data.
- Parameters:
data (pandas.DataFrame) – The data to be used to create the matrix.
column_name (str) – The name of the column containing the text data.
remove_stopwords (bool) – Whether to remove stopwords.
clean_text (bool) – Whether to clean the text data.
remove_email_addresses (bool) – Whether to remove email addresses.
update_stopwords (list of str) – The list of additional stopwords to be removed.
kwargs (dict) – Other arguments to be passed to sklearn.feature_extraction.text.TfidfVectorizer.
- Returns:
Y (numpy.ndarray) – The matrix created from the text data.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.time_series_matrix_and_attributes(data, time_col, drop_nas=True)
Create a matrix from a time series.
- Parameters:
data (pandas.DataFrame) – The data to be used to create the matrix.
time_col (str) – The name of the column containing the time information.
drop_nas (bool) – Whether to drop rows with missing values.
- Returns:
Y (numpy.ndarray) – The matrix created from the time series.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
- pyemb.preprocessing.to_networkx(A, attributes, symmetric=None)
Convert a multipartite graph to a networkx graph.
Embedding
- pyemb.embedding.ISE(As, d, flat=True, procrustes=False, consistent_orientation=True)
Computes the spectral embedding (ISE) for each adjacency snapshot.
- Parameters:
As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
procrustes (bool, optional) – Whether to align each embedding with the previous embedding. Default is False.
consistent_orientation (bool, optional) – Whether to ensure the eigenvector orientation is consistent. Default is True.
- Returns:
Dynamic embedding of shape (n*T, d) or (T, n, d).
- Return type:
numpy.ndarray
- pyemb.embedding.OMNI(As, d, flat=True, sparse_matrix=False)
Computes the omnibus dynamic spectral embedding. For more details, see: https://arxiv.org/abs/1705.09355
- Parameters:
As (numpy.ndarray) – Adjacency matrices of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether to use sparse matrices. Default is False.
- Returns:
Dynamic embedding of shape (n*T, d) or (T, n, d).
- Return type:
numpy.ndarray
- pyemb.embedding.UASE(As, d, flat=True, sparse_matrix=False, return_left=False)
Computes the unfolded adjacency spectral embedding (UASE). For more details, see: https://arxiv.org/abs/2007.10455 https://arxiv.org/abs/2106.01282
- Parameters:
As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.
return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.
- Returns:
numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).
numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.
- pyemb.embedding.dyn_embed(As, d=50, method='UASE', regulariser='auto', flat=True)
Computes the dynamic embedding using a specified method.
- Parameters:
As (numpy.ndarray or list) – An adjacency matrix series which is either a numpy array of shape (T, n, n), a list of numpy arrays of shape (n, n), or a series of CSR matrices.
d (int, optional) – Embedding dimension. Default is 50.
method (str, optional) – The embedding method to use. Options are “ISE”, “ISE PROCRUSTES”, “UASE”, “OMNI”, “ULSE”, “URLSE”, “RANDOM”. Default is “UASE”.
regulariser (float or "auto", optional) – Regularisation parameter for the Laplacian matrix. If “auto”, the regulariser is set to the average node degree. Default is “auto”.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
- Returns:
Dynamic embedding of shape (n*T, d) or (T, n, d).
- Return type:
numpy.ndarray
- Raises:
Exception – If the specified method is not recognized.
- pyemb.embedding.eigen_decomp(A, dim=None)
Perform eigenvalue decomposition of a matrix.
- Parameters:
A (numpy.ndarray) – The matrix to be decomposed.
dim (int) – The number of dimensions to be returned.
- Returns:
eigenvalues (numpy.ndarray) – The eigenvalues.
eigenvectors (numpy.ndarray) – The eigenvectors.
- pyemb.embedding.embed(Y, d=50, version='sqrt', return_right=False, flat=True, make_laplacian=False, regulariser=0)
Embed a matrix.
- Parameters:
Y (numpy.ndarray or list of numpy.ndarray) – The matrix to embed.
d (int) – The number of dimensions to embed into.
version (str) – The version of the embedding. Options are ‘full’ or ‘sqrt’ (default).
return_right (bool) – Whether to return the right embedding.
flat (bool) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d).
make_laplacian (bool) – Whether to use the Laplacian matrix.
regulariser (float) – The regulariser to be added to the degrees of the nodes. (only used if make_laplacian=True)
- Returns:
left_embedding (numpy.ndarray) – The left embedding.
right_embedding (numpy.ndarray) – The right embedding.
- pyemb.embedding.regularised_ULSE(As, d, regulariser='auto', flat=True, sparse_matrix=False, return_left=False)
Computes the regularised unfolded Laplacian spectral embedding (regularised ULSE).
- Parameters:
As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
regulariser (float, optional) – Regularisation parameter for the Laplacian matrix. By default, this is the average node degree.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.
return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.
- Returns:
numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).
numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.
- pyemb.embedding.wasserstein_dimension_select(Y, dims, split=0.5)
Select the number of dimensions using Wasserstein distances.
- Parameters:
Y (numpy.ndarray) – The array of matrices.
dims (list of int) – The dimensions to be considered.
split (float) – The proportion of the data to be used for training.
- Returns:
ws – The Wasserstein distances between the training and test data for each number of dimensions. The dimension recommended is the one with the smallest Wasserstein distance.
- Return type:
list of numpy.ndarray
Plotting
- pyemb.plotting.get_fig_legend_handles_labels(fig)
Get the legend handles and labels from a figure.
- pyemb.plotting.quick_plot(embedding, n, T=1, node_labels=None, **kwargs)
Produces an interactive plot an embedding. If the embedding is dynamic (i.e. T > 1), then the embedding will be animated over time.
- Parameters:
embedding (numpy.ndarray (n*T, d) or (T, n, d)) – The dynamic embedding.
n (int) – The number of nodes.
T (int (optional)) – The number of time points (> 1 animates the embedding).
node_labels (list of length n (optional)) – The labels of the nodes (time-invariant).
return_df (bool (optional)) – Option to return the plotting dataframe.
title (str (optional)) – The title of the plot.
- pyemb.plotting.snapshot_plot(embedding, n=None, node_labels=None, c=None, idx_of_interest=None, max_cols=4, add_legend=False, title=None, sharex=False, sharey=False, figsize_scale=5, figsize=None, bbox_to_anchor=(0.5, -0.1), loc='lower center', max_legend_cols=4, **kwargs)
Plot a snapshot of an embedding at a given time point.
- Parameters:
embedding (np.ndarray or list of np.ndarray) – The embedding to plot.
n (int (optional)) – The number of nodes in the graph. Should be provided if the embedding is a single numpy array and n is not the first dimension of the array.
node_labels (list (optional)) – The labels of the nodes. Default is None.
c (list or dict (optional)) – The colors of the nodes. If a list is provided, it should be a list of length n. If a dictionary is provided, it should map each unique label to a colour.
idx_of_interest (list (optional)) – The indices which to plot. For example if embedding is a list, idx_of_interest can be used to plot only a subset of the embeddings. By default, all embeddings are plotted.
max_cols (int (optional)) – The maximum number of columns in the plot. Default is 4.
add_legend (bool (optional)) – Whether to add a legend to the plot. Default is False.
title (str (optional)) – The title of the plot. If a list is provided, each element will be the title of a subplot. Default is None.
sharex (bool (optional)) – Whether to share the x-axis across subplots. Default is False.
sharey (bool (optional)) – Whether to share the y-axis across subplots. Default is False.
figsize_scale (int (optional)) – The scale of the figure size. Default is 5.
figsize (tuple (optional)) – The figure size. Default is None.
bbox_to_anchor (tuple (optional)) – The bbox_to_anchor parameter for the legend. Default is (0.5,-.1).
loc (str (optional)) – The location of the legend. Default is ‘lower center’.
max_legend_cols (int (optional)) – The maximum number of columns in the legend. Default is 4.
kwargs (dict (optional)) – Additional keyword arguments for the scatter plot.
- Returns:
fig – The figure object.
- Return type:
matplotlib.figure.Figure
Hierarchical Clustering
- class pyemb.hc.ConstructTree(point_cloud=None, model=None, epsilon=0.25)
Bases:
object
Construct a condensed tree from a hierarchical clustering model.
- Parameters:
model (AgglomerativeClustering, optional) – The fitted model.
point_cloud (ndarray, optional) – The data points.
epsilon (float, optional) – The threshold for condensing the tree.
**kwargs (dict, optional) – Additional keyword arguments.
- model
The fitted model.
- Type:
AgglomerativeClustering
- point_cloud
The data points.
- Type:
ndarray
- epsilon
The threshold for condensing the tree.
- Type:
float
- linkage
The linkage matrix.
- Type:
ndarray
- tree
The condensed tree.
- Type:
nx.Graph
- collapsed_branches
The collapsed branches.
- Type:
dict
- fit(**kwargs)
Fit the condensed tree.
- plot(labels=None, colours=None, colour_threshold=0.5, prog='sfdp', forceatlas_iter=250, node_size=10, scaling_node_size=1, **kwargs)
Plot the condensed tree.
- class pyemb.hc.DotProductAgglomerativeClustering(metric='dot_product', linkage='average', distance_threshold=0, n_clusters=None)
Bases:
object
Perform hierarchical clustering using dot product as the metric.
- Parameters:
metric (str, optional) – The metric to use for clustering.
linkage (str, optional) – The linkage criterion to use.
distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged.
n_clusters (int, optional) – The number of clusters to find.
- distances_
The distances between the clusters.
- Type:
ndarray
- children_
The children of each non-leaf node.
- Type:
ndarray
- labels_
The labels of each point.
- Type:
ndarray
- n_clusters_
The number of clusters.
- Type:
int
- n_connected_components_
The number of connected components.
- Type:
int
- n_leaves_
The number of leaves.
- Type:
int
- n_features_in_
The number of features seen during fit.
- Type:
int
- n_clusters_
The number of clusters.
- Type:
int
- fit(X)
- pyemb.hc.branch_lengths(Z, point_cloud=None)
Calculate branch lengths for a hierarchical clustering dendrogram.
- Parameters:
(ndarray) (point_cloud)
(ndarray)
- Returns:
ndarray
- Return type:
Matrix of branch lengths.
- pyemb.hc.cophenetic_distances(Z)
Calculate the cophenetic distances between each observation and internal nodes.
- Parameters:
Z (ndarray) – The linkage matrix.
- Returns:
d – The full distance matrix (2n-1) x (2n-1).
- Return type:
ndarray
- pyemb.hc.find_descendents(Z, node, desc=None, just_leaves=True)
Find all descendants of a given node in a hierarchical clustering tree.
- Parameters:
(ndarray) (Z)
(int) (node)
(dict (desc)
optional) (Whether to include only leaf nodes.)
(bool (just_leaves)
optional)
- Returns:
list
- Return type:
List of descendants.
- pyemb.hc.get_ranking(model)
Get the ranking of the samples.
- Parameters:
model (AgglomerativeClustering) – The fitted model.
- Returns:
mh_rank – The ranking of the samples.
- Return type:
numpy.ndarray
- pyemb.hc.kendalltau_similarity(model, true_ranking)
Calculate the Kendall’s tau similarity between the model and true ranking.
- Parameters:
model (AgglomerativeClustering) – The fitted model.
true_ranking (array-like, shape (n_samples, n_samples)) – The true ranking of the samples.
- Returns:
The mean Kendall’s tau similarity between the model and true ranking.
- Return type:
float
- pyemb.hc.linkage_matrix(model)
Convert a hierarchical clustering model to a linkage matrix.
- Parameters:
model (AgglomerativeClustering) – The fitted model.
get_heights (bool, optional) – Whether to return heights or counts.
max_height (float, optional) – The maximum height of the tree.
- Returns:
The linkage matrix.
- Return type:
ndarray
- pyemb.hc.plot_dendrogram(model, dot_product_clustering=True, rescale=False, **kwargs)
Create linkage matrix and then plot the dendrogram
- Parameters:
model (AgglomerativeClustering) – The fitted model to plot.
**kwargs (dict) – Keyword arguments for dendrogram function.
- Return type:
None
- pyemb.hc.sample_hyperbolicity(data, metric='dot_products', num_samples=5000)
Calculate the hyperbolicity of the data.
- Parameters:
data (numpy.ndarray) – The data to calculate the hyperbolicity.
metric (str) – The metric to use. Options are ‘dot_products’, ‘cosine_similarity’, ‘precomputed’ or any metric supported by scikit-learn.
num_samples (int) – The number of samples to calculate the hyperbolicity.
- Returns:
The hyperbolicity of the data.
- Return type:
float
Matrix and Graph Tools
- pyemb.tools.degree_correction(embedding)
Perform degree correction.
- Parameters:
embedding (numpy.ndarray) – The embedding of the graph, either 2D or 3D.
- Returns:
embedding_dc – The degree-corrected embedding.
- Return type:
numpy.ndarray
- pyemb.tools.recover_subspaces(embedding, attributes)
Recover the subspaces for each partition from an embedding.
- Parameters:
embedding (numpy.ndarray) – The embedding of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
- Returns:
partition_embeddings (dict) – The embeddings of the partitions.
partition_attributes (dict) – The attributes of the nodes in the partitions.
- pyemb.tools.select(embedding, attributes, select_attributes)
Select portion of embedding and attributes associated with a set of attributes.
- Parameters:
embedding (numpy.ndarray) – The embedding of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
select_attributes (dict or list of dicts) – The attributes to select by. If a list of dicts is provided, the intersection of the nodes satisfying each dict is selected.
- Returns:
selected_X (numpy.ndarray) – The selected embedding.
selected_attributes (list of lists) – The attributes of the selected nodes.
- pyemb.tools.to_laplacian(A, regulariser=0)
Convert an adjacency matrix to a Laplacian matrix.
- Parameters:
A (scipy.sparse.csr_matrix) – The adjacency matrix.
regulariser (float) – The regulariser to be added to the degrees of the nodes. If ‘auto’, the regulariser is set to the mean of the degrees.
- Returns:
L – The Laplacian matrix.
- Return type:
scipy.sparse.csr_matrix
- pyemb.tools.varimax(Phi, gamma=1, q=20, tol=1e-06)
Perform varimax rotation.
- Parameters:
Phi (numpy.ndarray) – The matrix to rotate.
gamma (float, optional) – The gamma parameter.
q (int, optional) – The number of iterations.
tol (float, optional) – The tolerance.
- Returns:
The rotated matrix.
- Return type:
numpy.ndarray
Simulation
- pyemb.simulation.SBM(n=200, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))
- pyemb.simulation.iid_SBM(n=200, T=2, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))
- pyemb.simulation.symmetrises(A, diag=False)