Development and validation of machine learning models for the prediction of SH-2 containing protein tyrosine phosphatase 2 inhibitors.

Published in: Molecular diversity (2023)

Discovery and development of a new drug to the market is a highly challenging and resource consuming process. Although, modern drug discovery technologies have enabled the rapid identification of lead compounds, translation of the lead compounds into successful clinical candidates remains a big challenge. In recent years, the availability of massive structural and biological data of diverse small molecules and macromolecules has helped the researchers to deep mine the multidimensional data with the help of artificial intelligence-based predictive tools to draw useful insights on the structural features of biological or therapeutic significance. The aim of this study was to utilize the available data on small molecule (SH2)-containing protein tyrosine phosphatase 2 (SHP2) inhibitors to build and develop machine learning (ML) models that can predict the SHP2 inhibitory potential of new compounds. The dataset contained 2739 unique small molecule SHP2 inhibitors obtained from the BindingDB, ChEMBL and recent literature. After curation of the data, the predictive models such as XGBoost, K nearest neighbours, neural networks were developed and validated through a tenfold cross-validation testing procedure. Out of the seven models developed, the XGBoost model showed an excellent performance with ROC AUC score of 0.96 and accuracy of 0.97 on the test data. Moreover, the Shapley Additive Explanations method was applied to assess a more in-depth understanding of the influence of variables on the model's predictions. In summary, the XGBoost model developed in this study can be useful in the identification of novel SHP2 inhibitors and therefore, can accelerate the discovery of novel therapeutics for cancer therapy.

Keyphrases