Ligand-based approaches to activity prediction for the early stage of structure-activity-relationship progression.
Itsuki MaedaAkinori SatoShunsuke TamuraTomoyuki MiyaoPublished in: Journal of computer-aided molecular design (2022)
The retrospective evaluation of virtual screening approaches and activity prediction models are important for methodological development. However, for fair comparison, evaluation data sets must be carefully prepared. In this research, we compiled structure-activity-relationship matrix-based data sets for 15 biological targets along with many diverse inactive compounds, assuming the early stage of structure-activity-relationship progression. To use a large number of diverse inactive compounds and a limited number of active compounds, similarity profiles (SPs) are proposed as a set of molecular descriptors. Using these highly imbalanced data sets, we evaluated various approaches including SPs, under-sampling, support vector machine (SVM), and message passing neural networks. We found that for the under-sampling approaches, cluster-based sampling is better than random sampling. For virtual screening, SPs with inactive reference compounds and the under-sampling SVM also perform well. For classification, SPs with many inactive references performed as well as the under-sampling SVM trained on a balanced data set. Although the performance of SPs and the under-sampling SVM were comparable, SPs with many inactive references were preferable for selecting structurally distinct compounds from the active training compounds.