Login / Signup

Systematic Comparison and Comprehensive Evaluation of 80 Amino Acid Descriptors in Peptide QSAR Modeling.

Peng ZhouQian LiuTing WuQingqing MiaoShuyong ShangHeyi WangZheng ChenShaozhou WangHeyan Wang
Published in: Journal of chemical information and modeling (2021)
The peptide quantitative structure-activity relationship (QSAR), also known as the quantitative sequence-activity model (QSAM), has attracted much attention in the bio- and chemoinformatics communities and is a well developed computational peptidology strategy to statistically correlate the sequence/structure and activity/property relationships of functional peptides. Amino acid descriptors (AADs) are one of the most widely used methods to characterize peptide structures by decomposing the peptide into its residue building blocks and sequentially parametrizing each building block with a vector of amino acid principal properties. Considering that various AADs have been proposed over the past decades and new AADs are still emerging today, we herein query the following: is it necessary to develop so many AADs and do we need to continuously develop more new AADs? In this study, we exhaustively collect 80 published AADs and comprehensively evaluate their modeling performance (including fitting ability, internal stability, and predictive power) on 8 QSAR-oriented peptide sample sets (QPSs) by employing 2 sophisticated machine learning methods (MLMs), totally building and systematically comparing 1280 (80 AADs × 8 QPSs × 2 MLMs) peptide QSAR models. The following is revealed: (i) None of the AADs can work best on all or most peptide sets; an AAD usually performs well for some peptides but badly for others. (ii) Modeling performance is primarily determined by the peptide samples and then the MLMs used, while AADs have only a moderate influence on the performance. (iii) There is no essential difference between the modeling performances of different AAD types (physiochemical, topological, 3D-structural, etc.). (iv) Two random descriptors, which are separately generated randomly in standard normal distribution N(0, 1) and uniform distribution U(-1, +1), do not perform significantly worse than these carefully developed AADs. (v) A secondary descriptor, which carries major information involved in the 80 (primary) AADs, does not perform significantly better than these AADs. Overall, we conclude that since there are various AADs available to date and they already cover numerous amino acid properties, further development of new AADs is not an essential choice to improve peptide QSAR modeling; the traditional AAD methodology is believed to have almost reached the theoretical limit nowadays. In addition, the AADs are more likely to be a vector symbol but not informative data; they are utilized to mark and distinguish the 20 amino acids but do not really bring much original property information to these amino acids.
Keyphrases