Scalable Text Mining Assisted Curation of Post-Translationally Modified Proteoforms in the Protein Ontology.
Karen E RossDarren A NataleCecilia ArighiSheng-Chih ChenHongzhan HuangGang LiJia RenMichael WangK Vijay-ShankerCathy H WuPublished in: CEUR workshop proceedings (2016)
The Protein Ontology (PRO) defines protein classes and their interrelationships from the family to the protein form (proteoform) level within and across species. One of the unique contributions of PRO is its representation of post-translationally modified (PTM) proteoforms. However, progress in adding PTM proteoform classes to PRO has been relatively slow due to the extensive manual curation effort required. Here we report an automated pipeline for creation of PTM proteoform classes that leverages two phosphorylation-focused text mining tools (RLIMS-P, which detects mentions of kinases, substrates, and phosphorylation sites, and eFIP, which detects phosphorylation-dependent protein-protein interactions (PPIs)) and our integrated PTM database, iPTMnet. By applying this pipeline, we obtained a set of ~820 substrate-site pairs that are suitable for automated PRO term generation with literature-based evidence attribution. Inclusion of these terms in PRO will increase PRO coverage of species-specific PTM proteoforms by 50%. Many of these new proteoforms also have associated kinase and/or PPI information. Finally, we show a phosphorylation network for the human and mouse peptidyl-prolyl cis-trans isomerase (PIN1/Pin1) derived from our dataset that demonstrates the biological complexity of the information we have extracted. Our approach addresses scalability in PRO curation and will be further expanded to advance PRO representation of phosphorylated proteoforms.
Keyphrases
- anti inflammatory
- protein protein
- protein kinase
- systematic review
- endothelial cells
- machine learning
- amino acid
- healthcare
- preterm infants
- smoking cessation
- high throughput
- small molecule
- emergency department
- single cell
- gestational age
- electronic health record
- drug induced
- health insurance
- data analysis
- neural network