molecular_simulations.analysis.autocluster module

Automated clustering module for molecular dynamics data.

This module provides tools for automatic clustering of molecular dynamics trajectory data using KMeans++ with dimensionality reduction.

class molecular_simulations.analysis.autocluster.GenericDataloader(data_files)[source]

Bases: object

Loads generic data stored in numpy arrays.

Stores the full dataset and is capable of loading data with variable row lengths but must be consistent in the columnar dimension.

files

List of paths to the loaded data files.

data_array

The concatenated data array.

shapes

List of shapes of the original data files.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files (.npy format).

Example

>>> loader = GenericDataloader(['data1.npy', 'data2.npy'])
>>> print(loader.data.shape)

Initialize the dataloader with a list of data files.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files (.npy format).

__init__(data_files)[source]

Initialize the dataloader with a list of data files.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files (.npy format).

load_data()[source]

Load and concatenate data from all files into one array.

Lumps data into one large array by vertical stacking. If the resulting array has more than 2 dimensions, it is reshaped to 2D.

Return type:

None

property data: numpy.ndarray

Return the internal data array.

Returns:

The concatenated and reshaped data array.

property shape: tuple[int]

Return the shape(s) of the input data.

Returns:

If all input files have the same shape, returns that shape. Otherwise, returns a list of shapes in the order the files were provided.

class molecular_simulations.analysis.autocluster.PeriodicDataloader(data_files)[source]

Bases: GenericDataloader

Dataloader that decomposes periodic data using sin and cos.

Extends GenericDataloader to handle periodic data by decomposing each feature into sin and cos components, effectively doubling the number of features.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files containing periodic data (e.g., dihedral angles).

Example

>>> loader = PeriodicDataloader(['dihedrals.npy'])
>>> # Original 10 features become 20 features

Initialize the periodic dataloader.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files.

__init__(data_files)[source]

Initialize the periodic dataloader.

Parameters:

data_files (list[Union[Path, str]]) – List of paths to input data files.

load_data()[source]

Load periodic data and remove periodicity.

Loads each file, applies the periodicity removal transformation, and stores the results.

Return type:

None

remove_periodicity(arr)[source]

Remove periodicity from each feature using sin and cos.

Each column is expanded into two columns such that the indices become i -> 2*i, 2*i + 1. This preserves the circular nature of periodic variables like angles.

Parameters:

arr (ndarray) – Data array with periodic features. Shape should be (n_samples, n_features).

Return type:

ndarray

Returns:

New array with shape (arr.shape[0], arr.shape[1] * 2) where each original feature is replaced by its cos and sin values.

class molecular_simulations.analysis.autocluster.AutoKMeans(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})[source]

Bases: object

Automatic clustering using KMeans++ with dimensionality reduction.

Performs automatic clustering including dimensionality reduction of the feature space and parameter sweep over number of clusters using silhouette score optimization.

data

The loaded data array.

shape

Shape of the input data.

reduced

Dimensionality-reduced data.

centers

Cluster centers in reduced space.

labels

Cluster assignments for each data point.

cluster_centers

Mapping of cluster index to (replica, frame) tuple.

Parameters:
  • data_directory (Union[Path, str]) – Directory where data files can be found.

  • pattern (str) – Optional filename pattern to select subset of npy files using glob. Defaults to empty string (all .npy files).

  • dataloader (Type[TypeVar(_T)]) – Which dataloader class to use. Defaults to GenericDataloader.

  • max_clusters (int) – Maximum number of clusters to test during parameter sweep. Defaults to 10.

  • stride (int) – Linear stride of number of clusters during parameter sweep. Helps avoid testing too many values. Defaults to 1.

  • reduction_algorithm (str) – Which dimensionality reduction algorithm to use. Currently only ‘PCA’ is supported. Defaults to ‘PCA’.

  • reduction_kws (dict[str, Any]) – Keyword arguments for the reduction algorithm. Defaults to {‘n_components’: 2}.

Example

>>> clusterer = AutoKMeans('data/', max_clusters=15)
>>> clusterer.run()
>>> print(clusterer.cluster_centers)

Initialize the automatic clustering workflow.

Parameters:
  • data_directory (Union[Path, str]) – Directory where data files can be found.

  • pattern (str) – Optional filename pattern for glob matching.

  • dataloader (Type[TypeVar(_T)]) – Dataloader class to use for loading data.

  • max_clusters (int) – Maximum number of clusters to test.

  • stride (int) – Step size for cluster number sweep.

  • reduction_algorithm (str) – Dimensionality reduction method.

  • reduction_kws (dict[str, Any]) – Arguments for the reduction algorithm.

__init__(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})[source]

Initialize the automatic clustering workflow.

Parameters:
  • data_directory (Union[Path, str]) – Directory where data files can be found.

  • pattern (str) – Optional filename pattern for glob matching.

  • dataloader (Type[TypeVar(_T)]) – Dataloader class to use for loading data.

  • max_clusters (int) – Maximum number of clusters to test.

  • stride (int) – Step size for cluster number sweep.

  • reduction_algorithm (str) – Dimensionality reduction method.

  • reduction_kws (dict[str, Any]) – Arguments for the reduction algorithm.

run()[source]

Run the complete automated clustering workflow.

Executes dimensionality reduction, parameter sweep, center mapping, and saves results to disk.

Return type:

None

reduce_dimensionality()[source]

Perform dimensionality reduction on the data.

Uses the configured decomposition algorithm to reduce the feature space dimensionality.

Return type:

None

sweep_n_clusters(n_clusters)[source]

Sweep over number of clusters to find optimal clustering.

Uses silhouette score to perform a parameter sweep over number of clusters. Stores the cluster centers and labels for the best performing parameterization.

Parameters:

n_clusters (list[int]) – List of cluster numbers to test.

Return type:

None

map_centers_to_frames()[source]

Map cluster centers to the closest actual data points.

Finds and stores the data point which lies closest to each cluster center, recording the replica and frame indices.

Return type:

None

save_centers()[source]

Save cluster centers to a JSON file.

Writes the cluster_centers dictionary to ‘cluster_centers.json’ in the data directory.

Return type:

None

save_labels()[source]

Save cluster labels to a Parquet file.

Generates a Polars DataFrame containing system, frame, and cluster label assignments and saves to ‘cluster_assignments.parquet’.

Return type:

None

class molecular_simulations.analysis.autocluster.Decomposition(algorithm, **kwargs)[source]

Bases: object

Wrapper for dimensionality reduction algorithms.

Provides a thin wrapper around various dimensionality reduction algorithms using scikit-learn style methods.

decomposer

The underlying decomposition algorithm instance.

Parameters:
  • algorithm (str) – Which algorithm to use. Options are ‘PCA’, ‘TICA’, and ‘UMAP’. Currently only ‘PCA’ is fully supported.

  • **kwargs – Algorithm-specific keyword arguments passed to the decomposer constructor.

Example

>>> decomp = Decomposition('PCA', n_components=3)
>>> reduced_data = decomp.fit_transform(data)

Initialize the decomposition wrapper.

Parameters:
  • algorithm (str) – Name of the decomposition algorithm.

  • **kwargs – Arguments passed to the algorithm constructor.

Raises:

KeyError – If an unsupported algorithm is specified.

__init__(algorithm, **kwargs)[source]

Initialize the decomposition wrapper.

Parameters:
  • algorithm (str) – Name of the decomposition algorithm.

  • **kwargs – Arguments passed to the algorithm constructor.

Raises:

KeyError – If an unsupported algorithm is specified.

fit(X)[source]

Fit the decomposer with data.

Parameters:

X (ndarray) – Array of input data with shape (n_samples, n_features).

Return type:

None

transform(X)[source]

Transform data using the fitted decomposer.

Parameters:

X (ndarray) – Array of input data with shape (n_samples, n_features).

Return type:

ndarray

Returns:

Reduced dimension data with shape (n_samples, n_components).

Raises:

sklearn.exceptions.NotFittedError – If called before fit().

fit_transform(X)[source]

Fit the decomposer and transform data in one step.

Parameters:

X (ndarray) – Array of input data with shape (n_samples, n_features).

Return type:

ndarray

Returns:

Reduced dimension data with shape (n_samples, n_components).