DFT-Machine Learning Approach for Accurate Prediction of pKa.
Robin LawlerYao-Hao LiuNessa MajayaOmar AllamHyunchul JuJin Young KimSeung Soon JangPublished in: The journal of physical chemistry. A (2021)
In this study, we propose a novel method of pKa prediction in a diverse set of acids, which combines density functional theory (DFT) method with machine learning (ML) methods. First, the DFT method with B3LYP/6-31++G**/SM8 is used to predict pKa, yielding a mean absolute error of 1.85 pKa units. Subsequently, such pKa values predicted from the DFT method are employed as one of 10 molecular descriptors for developing ML models trained on experimental data. Kernel Ridge Regression (KRR), Gaussian Process Regression, and Artificial Neural Network are optimized using three Pipelines: Pipeline 1 involving only hyperparameter optimization (HPO), Pipeline 2 involving HPO followed by a relative contribution analysis (RCA) and recursive feature elimination (RFE), and Pipeline 3 involving HPO followed by RCA and RFE on an expanded set of composite features. Finally, it is demonstrated that KRR with Pipeline 3 yields optimal pKa prediction at an MAE of 0.60 log units. This algorithm was then utilized to predict the pKa of 37 novel acids. The two most important features were determined to be the number of hydrogen atoms in the molecule and the degree of oxidation of the acid. The predicted pKa values were documented for future reference.