k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations.

Lexin ChenDaniel R RoeMatthew KochertCarlos L Simmerling Ramón Alain Miranda-Quintana

Published in: Journal of chemical theory and computation (2024)

One of the key challenges of k -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex data sets such as those obtained from molecular simulation, k -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k -means++ will lead to a lack of reproducibility. K -means N -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse data sets and be used as a standalone tool or as part of our MDANCE clustering package.

Keyphrases