Login / Signup

Machine learning for identification of silylated derivatives from mass spectra.

Milka LjonchevaTomaž StepišnikTina KosjekSašo Džeroski
Published in: Journal of cheminformatics (2022)
This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC-MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC-MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC-MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation.
Keyphrases