Login / Signup

Application of the mol2vec Technology to Large-size Data Visualization and Analysis.

Shojiro ShibayamaGilles MarcouDragos HorvathIgor I BaskinKimito FunatsuAlexander Varnek
Published in: Molecular informatics (2020)
Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure-activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descriptors used for encoding molecular structures. The Principal Component Analysis (PCA), a standard preprocessing tool, suffers from the information loss upon the dimensionality reduction. As an alternative, we propose to use substructure vector embedding provided by the mol2vec technique. In addition to the data dimensionality reduction, this technology also accounts for proximity of substructures in molecular graphs. In this study, dimensionality of large descriptor spaces of ISIDA fragment descriptors or Morgan fingerprints were reduced using either the PCA or the mol2vec method. The latter significantly speeds up GTM training without compromising its predictive power in bioactivity classification tasks.
Keyphrases
  • electronic health record
  • big data
  • high resolution
  • data analysis
  • working memory
  • single molecule