Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns.
Andrew D McEachranIlya BalabinTommy CatheyThomas R TransueHussein Al-GhoulChris GrulkeJon R SobusAntony J WilliamsPublished in: Scientific data (2019)
Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA's DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA's CompTox Chemicals Dashboard.
Keyphrases
- ms ms
- high resolution mass spectrometry
- ultra high performance liquid chromatography
- electronic health record
- liquid chromatography
- big data
- tandem mass spectrometry
- high performance liquid chromatography
- liquid chromatography tandem mass spectrometry
- mass spectrometry
- gas chromatography
- high resolution
- data analysis
- emergency department
- multiple sclerosis
- machine learning
- drinking water
- adverse drug
- deep learning