Structure and Protein Interaction-Based Gene Ontology Annotations Reveal Likely Functions of Uncharacterized Proteins on Human Chromosome 17.
Chengxin ZhangXiaoqiong WeiGilbert S OmennYang ZhangPublished in: Journal of proteome research (2018)
Understanding the function of human proteins is essential to decipher the molecular mechanisms of human diseases and phenotypes. Of the 17 470 human protein coding genes in the neXtProt 2018-01-17 database with unequivocal protein existence evidence (PE1), 1260 proteins do not have characterized functions. To reveal the function of poorly annotated human proteins, we developed a hybrid pipeline that creates protein structure prediction using I-TASSER and infers functional insights for the target protein from the functional templates recognized by COFACTOR. As a case study, the pipeline was applied to all 66 PE1 proteins with unknown or insufficiently specific function (uPE1) on human chromosome 17 as of neXtProt 2017-07-01. Benchmark testing on a control set of 100 well-characterized proteins randomly selected from the same chromosome shows high Gene Ontology (GO) term prediction accuracies of 0.69, 0.57, and 0.67 for molecular function (MF), biological process (BP), and cellular component (CC), respectively. Three pipelines of function annotations (homology detection, protein-protein interaction network inference, and structure template identification) have been exploited by COFACTOR. Detailed analyses show that structure template detection based on low-resolution protein structure prediction made the major contribution to the enhancement of the sensitivity and precision of the annotation predictions, especially for cases that do not have sequence-level homologous templates. For the chromosome 17 uPE1 proteins, the I-TASSER/COFACTOR pipeline confidently assigned MF, BP, and CC for 13, 33, and 49 proteins, respectively, with predicted functions ranging from sphingosine N-acyltransferase activity and sugar transmembrane transporter to cytoskeleton constitution. We highlight the 13 proteins with confident MF predictions; 11 of these are among the 33 proteins with confident BP predictions and 12 are among the 49 proteins with confident CC. This study demonstrates a novel computational approach to systematically annotate protein function in the human proteome and provides useful insights to guide experimental design and follow-up validation studies of these uncharacterized proteins.