ChloroDBPFinder: Machine Learning-Guided Recognition of Chlorinated Disinfection Byproducts from Nontargeted LC-HRMS Analysis.
Tingting ZhaoNicholas J P WawrykShipei XingBrian LowGigi LiHuaxu YuYukai WangQiming ShenXing-Fang LiTao HuanPublished in: Analytical chemistry (2024)
High-resolution mass spectrometry (HRMS) is a prominent analytical tool that characterizes chlorinated disinfection byproducts (Cl-DBPs) in an unbiased manner. Due to the diversity of chemicals, complex background signals, and the inherent analytical fluctuations of HRMS, conventional isotopic pattern ( 37 Cl/ 35 Cl), mass defect, and direct molecular formula (MF) prediction are insufficient for accurate recognition of the diverse Cl-DBPs in real environmental samples. This work proposes a novel strategy to recognize Cl-containing chemicals based on machine learning. Our hierarchical machine learning framework has two random forest-based models: the first layer is a binary classifier to recognize Cl-containing chemicals, and the second layer is a multiclass classifier to annotate the number of Cl present. This model was trained using ∼1.4 million distinctive MFs from PubChem. Evaluated on over 14,000 unique MFs from NIST20, this machine learning model achieved 93.3% accuracy in recognizing Cl-containing MFs (Cl-MFs) and 92.9% accuracy in annotating the number of Cl for Cl-MFs. Furthermore, the trained model was integrated into ChloroDBPFinder, a standalone R package for the streamlined processing of LC-HRMS data and annotating both known and unknown Cl-containing compounds. Tested on existing Cl-DBP data sets related to aspartame chlorination in tap water, our ChloroDBPFinder efficiently extracted 159 Cl-containing DBP features and tentatively annotated the structures of 10 Cl-DBPs via molecular networking. In another application of a chlorinated humic substance, ChloroDBPFinder extracted 79 high-quality Cl-DBPs and tentatively annotated six compounds. In summary, our proposed machine learning strategy and the developed ChloroDBPFinder provide an advanced solution to identifying Cl-containing compounds in nontargeted analysis of water samples. It is freely available on GitHub (https://github.com/HuanLab/ChloroDBPFinder).