Login / Signup

Predicting the absence of an unknown compound in a mass spectral database.

Andrey S SamokhinKsenia SotnezovaIgor Revelsky
Published in: European journal of mass spectrometry (Chichester, England) (2019)
Only a small subset of known organic compounds (amenable for gas chromatography/mass spectrometry) is present in the largest mass spectral databases (such as NIST or Wiley). Nevertheless, library search algorithms available in the market are not able to predict the absence of a compound in the database. In the present work, we have tried to implement such prediction by means of supervised classification. Training and validation set contained 1500 and 750 compounds, respectively. Two prediction sets (containing 750 and about 3000 mass spectra) were considered. The easiest-to-use models were built with only one input variable: match factor of the best candidate or InLib factor (both parameters were calculated within MS Search (NIST) software). Multivariate classification models were built by partial least squares discriminant analysis (PLS-DA); match factors of top n candidates were used as input variables. PLS-DA was found to be the most effective approach. The prediction efficiency strongly depended on the 'uniqueness' of mass spectra presented in the test set. PLS-DA model was able to correctly predict the absence of a compound in the database in 29.9% for prediction set #1 and in 74.4% for prediction set #2 (only 1.3% and 2.5% of compounds actually presented in the database were wrongly classified).
Keyphrases
  • machine learning
  • gas chromatography mass spectrometry
  • adverse drug
  • optical coherence tomography
  • ms ms
  • data analysis
  • magnetic resonance imaging
  • density functional theory
  • electronic health record
  • water soluble