molecular_simulations.analysis.autocluster module¶
Automated clustering module for molecular dynamics data.
This module provides tools for automatic clustering of molecular dynamics trajectory data using KMeans++ with dimensionality reduction.
- class molecular_simulations.analysis.autocluster.GenericDataloader(data_files)[source]¶
Bases:
objectLoads generic data stored in numpy arrays.
Stores the full dataset and is capable of loading data with variable row lengths but must be consistent in the columnar dimension.
- files¶
List of paths to the loaded data files.
- data_array¶
The concatenated data array.
- shapes¶
List of shapes of the original data files.
Example
>>> loader = GenericDataloader(['data1.npy', 'data2.npy']) >>> print(loader.data.shape)
Initialize the dataloader with a list of data files.
- load_data()[source]¶
Load and concatenate data from all files into one array.
Lumps data into one large array by vertical stacking. If the resulting array has more than 2 dimensions, it is reshaped to 2D.
- Return type:
- property data: numpy.ndarray¶
Return the internal data array.
- Returns:
The concatenated and reshaped data array.
- class molecular_simulations.analysis.autocluster.PeriodicDataloader(data_files)[source]¶
Bases:
GenericDataloaderDataloader that decomposes periodic data using sin and cos.
Extends GenericDataloader to handle periodic data by decomposing each feature into sin and cos components, effectively doubling the number of features.
- Parameters:
data_files (
list[Union[Path,str]]) – List of paths to input data files containing periodic data (e.g., dihedral angles).
Example
>>> loader = PeriodicDataloader(['dihedrals.npy']) >>> # Original 10 features become 20 features
Initialize the periodic dataloader.
- load_data()[source]¶
Load periodic data and remove periodicity.
Loads each file, applies the periodicity removal transformation, and stores the results.
- Return type:
- class molecular_simulations.analysis.autocluster.AutoKMeans(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})[source]¶
Bases:
objectAutomatic clustering using KMeans++ with dimensionality reduction.
Performs automatic clustering including dimensionality reduction of the feature space and parameter sweep over number of clusters using silhouette score optimization.
- data¶
The loaded data array.
- shape¶
Shape of the input data.
- reduced¶
Dimensionality-reduced data.
- centers¶
Cluster centers in reduced space.
- labels¶
Cluster assignments for each data point.
- cluster_centers¶
Mapping of cluster index to (replica, frame) tuple.
- Parameters:
data_directory (
Union[Path,str]) – Directory where data files can be found.pattern (
str) – Optional filename pattern to select subset of npy files using glob. Defaults to empty string (all .npy files).dataloader (
Type[TypeVar(_T)]) – Which dataloader class to use. Defaults to GenericDataloader.max_clusters (
int) – Maximum number of clusters to test during parameter sweep. Defaults to 10.stride (
int) – Linear stride of number of clusters during parameter sweep. Helps avoid testing too many values. Defaults to 1.reduction_algorithm (
str) – Which dimensionality reduction algorithm to use. Currently only ‘PCA’ is supported. Defaults to ‘PCA’.reduction_kws (
dict[str,Any]) – Keyword arguments for the reduction algorithm. Defaults to {‘n_components’: 2}.
Example
>>> clusterer = AutoKMeans('data/', max_clusters=15) >>> clusterer.run() >>> print(clusterer.cluster_centers)
Initialize the automatic clustering workflow.
- Parameters:
data_directory (
Union[Path,str]) – Directory where data files can be found.pattern (
str) – Optional filename pattern for glob matching.dataloader (
Type[TypeVar(_T)]) – Dataloader class to use for loading data.max_clusters (
int) – Maximum number of clusters to test.stride (
int) – Step size for cluster number sweep.reduction_algorithm (
str) – Dimensionality reduction method.reduction_kws (
dict[str,Any]) – Arguments for the reduction algorithm.
- __init__(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})[source]¶
Initialize the automatic clustering workflow.
- Parameters:
data_directory (
Union[Path,str]) – Directory where data files can be found.pattern (
str) – Optional filename pattern for glob matching.dataloader (
Type[TypeVar(_T)]) – Dataloader class to use for loading data.max_clusters (
int) – Maximum number of clusters to test.stride (
int) – Step size for cluster number sweep.reduction_algorithm (
str) – Dimensionality reduction method.reduction_kws (
dict[str,Any]) – Arguments for the reduction algorithm.
- run()[source]¶
Run the complete automated clustering workflow.
Executes dimensionality reduction, parameter sweep, center mapping, and saves results to disk.
- Return type:
- reduce_dimensionality()[source]¶
Perform dimensionality reduction on the data.
Uses the configured decomposition algorithm to reduce the feature space dimensionality.
- Return type:
- sweep_n_clusters(n_clusters)[source]¶
Sweep over number of clusters to find optimal clustering.
Uses silhouette score to perform a parameter sweep over number of clusters. Stores the cluster centers and labels for the best performing parameterization.
- map_centers_to_frames()[source]¶
Map cluster centers to the closest actual data points.
Finds and stores the data point which lies closest to each cluster center, recording the replica and frame indices.
- Return type:
- class molecular_simulations.analysis.autocluster.Decomposition(algorithm, **kwargs)[source]¶
Bases:
objectWrapper for dimensionality reduction algorithms.
Provides a thin wrapper around various dimensionality reduction algorithms using scikit-learn style methods.
- decomposer¶
The underlying decomposition algorithm instance.
- Parameters:
algorithm (
str) – Which algorithm to use. Options are ‘PCA’, ‘TICA’, and ‘UMAP’. Currently only ‘PCA’ is fully supported.**kwargs – Algorithm-specific keyword arguments passed to the decomposer constructor.
Example
>>> decomp = Decomposition('PCA', n_components=3) >>> reduced_data = decomp.fit_transform(data)
Initialize the decomposition wrapper.
- Parameters:
algorithm (
str) – Name of the decomposition algorithm.**kwargs – Arguments passed to the algorithm constructor.
- Raises:
KeyError – If an unsupported algorithm is specified.