Investigation of Machine Intelligence in Compound Cell Activity Classification.

Yuanrong FanHaichun LiuYi HuaYuchen WangLu ZhuJunnan ZhaoYan YangXingye ChenShuai LuTao LuYa-Dong ChenHaichun Liu

Published in: Molecular pharmaceutics (2019)

Machine intelligence has been greatly developed in the past decades and has been widely used in many fields. In the recent years, many reports have shown its satisfactory effect in drug discovery. In this study, machine intelligence methods were explored to assist the cell activity prediction. Multiple machine intelligence methods including support vector machine, decision tree, random forest, extra trees, gradient boosting machine, convolutional neural network, long short-term memory network, and gated recurrent unit network were employed to separate compounds based on their cell activity. Different from some reported classification models, compounds were expressed as a string by the simplified molecular input line entry system and directly used as input rather than any chemical descriptors, which mimicked natural language processing. Both the single cell strain and whole data set under the balanced and imbalanced data distributions were discussed, respectively. Different activity cutoffs were set for the single (Z-score = 3) and the whole (Z-score = 5 and 6) data set. Nine metrics were used to evaluate the models including accuracy, precision, recall, f1-score, area under the receiver operating characteristic curve score, Cohen's κ, Brier score, Matthews correlation coefficient, and balanced accuracy. The results show that the gradient boosting machine is competent at balanced data distribution, and convolutional neural network is qualified for the imbalanced one. The results demonstrate that both classic machine learning methods and deep learning methods have potential in classification of compound cell activity.

Keyphrases