An integrative proteogenomics approach reveals peptides encoded by annotated lincRNA in the mouse kidney inner medulla.
Cameron T FlowerLihe ChenHyun Jun JungViswanathan RaghuramMark A KnepperChin-Rang YangPublished in: Physiological genomics (2020)
Long noncoding RNAs (lncRNAs) are intracellular transcripts longer than 200 nucleotides and lack protein-coding information. A subclass of lncRNA known as long intergenic noncoding RNAs (lincRNAs) are transcribed from genomic regions that share no overlap with annotated protein-coding genes. Increasing evidence has shown that some annotated lincRNA transcripts do in fact contain open reading frames (ORFs) encoding functional short peptides in the cell. Few robust methods for lincRNA-encoded peptide identification have been reported, and the tissue-specific expression of these peptides has been largely unexplored. Here we propose an integrative workflow for lincRNA-encoded peptide discovery and test it on the mouse kidney inner medulla (IM). In brief, low molecular weight protein fractions were enriched from homogenate of IMs and trypsinized into shorter peptides, which were sequenced by high resolution liquid chromatography-tandem mass spectrometry (LC-MS/MS). To curate a hypothetical lincRNA-encoded peptide database for peptide-spectrum matching following LC-MS/MS, we performed RNA-Seq on IMs, computationally removed reads overlapping with annotated protein-coding genes, and remapped the remaining reads to a database of mouse noncoding transcripts to infer lincRNA expression. Expressed lincRNAs were searched for ORFs by an existing rule-based algorithm, and translated ORFs were used for peptide-spectrum matching. Peptides identified by LC-MS/MS were further evaluated by using several quality control criteria and bioinformatics methods. We discovered three novel lincRNA-encoded peptides, which are conserved in mouse, rat, and human. The workflow can be adapted for discovery of small protein-coding genes in any species or tissue where noncoding transcriptome information is available.
Keyphrases
- amino acid
- rna seq
- single cell
- liquid chromatography tandem mass spectrometry
- binding protein
- high resolution
- protein protein
- genome wide
- poor prognosis
- quality control
- bioinformatics analysis
- endothelial cells
- stem cells
- machine learning
- gene expression
- genome wide identification
- healthcare
- simultaneous determination
- transcription factor
- emergency department
- ms ms
- working memory
- adverse drug
- genetic diversity