Improving predictions of compound amenability for liquid chromatography-mass spectrometry to enhance non-targeted analysis.
Nathaniel CharestCharles N LoweChristian RamslandBrian MeyerVicente SamanoAntony J WilliamsPublished in: Analytical and bioanalytical chemistry (2024)
Mass-spectrometry-based non-targeted analysis (NTA), in which mass spectrometric signals are assigned chemical identities based on a systematic collation of evidence, is a growing area of interest for toxicological risk assessment. Successful NTA results in better identification of potentially hazardous pollutants within the environment, facilitating the development of targeted analytical strategies to best characterize risks to human and ecological health. A supporting component of the NTA process involves assessing whether suspected chemicals are amenable to the mass spectrometric method, which is necessary in order to assign an observed signal to the chemical structure. Prior work from this group involved the development of a random forest model for predicting the amenability of 5517 unique chemical structures to liquid chromatography-mass spectrometry (LC-MS). This work improves the interpretability of the group's prior model of the same endpoint, as well as integrating 1348 more data points across negative and positive ionization modes. We enhance interpretability by feature engineering, a machine learning practice that reduces the input dimensionality while attempting to preserve performance statistics. We emphasize the importance of interpretable machine learning models within the context of building confidence in NTA identification. The novel data were curated by the labeling of compounds as amenable or unamenable by expert curators, resulting in an enhanced set of chemical compounds to expand the applicability domain of the prior model. The balanced accuracy benchmark of the newly developed model is comparable to performance previously reported (mean CV BA is 0.84 vs. 0.82 in positive mode, and 0.85 vs. 0.82 in negative mode), while on a novel external set, derived from this work's data, the Matthews correlation coefficients (MCC) for the novel models are 0.66 and 0.68 for positive and negative mode, respectively. Our group's prior published models scored MCC of 0.55 and 0.54 on the same external sets. This demonstrates appreciable improvement over the chemical space captured by the expanded dataset. This work forms part of our ongoing efforts to develop models with higher interpretability and higher performance to support NTA efforts.
Keyphrases
- liquid chromatography
- mass spectrometry
- machine learning
- high resolution mass spectrometry
- tandem mass spectrometry
- gas chromatography
- risk assessment
- high resolution
- big data
- healthcare
- electronic health record
- high performance liquid chromatography
- human health
- simultaneous determination
- capillary electrophoresis
- climate change
- primary care
- endothelial cells
- solid phase extraction
- cancer therapy
- artificial intelligence
- public health
- mental health
- deep learning
- randomized controlled trial
- data analysis
- clinical practice
- health information
- meta analyses