Login / Signup

k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations.

Lexin ChenDaniel R RoeMatthew KochertCarlos L SimmerlingRamón Alain Miranda-Quintana
Published in: bioRxiv : the preprint server for biology (2024)
One of the key challenges of k -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation, k -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k -means++ will lead to a lack of reproducibility. K -means N -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.
Keyphrases
  • molecular dynamics simulations
  • rna seq
  • electronic health record
  • high density
  • single cell
  • big data
  • systematic review
  • machine learning
  • single molecule
  • deep learning
  • artificial intelligence
  • low cost