The Integration of Proteogenomics and Ribosome Profiling Circumvents Key Limitations to Increase the Coverage and Confidence of Novel Microproteins.

Eduardo Vieira de SouzaAngie L BookoutChristopher A Barnes Brendan MillerPablo MachadoLuiz A BassoCristiano Valim Bizarro Alan Saghatelian

Published in: bioRxiv : the preprint server for biology (2023)

There has been a dramatic increase in the identification of non-conical translation and a significant expansion of the protein-coding genome and proteome. Among the strategies used to identify novel small ORFs (smORFs), Ribosome profiling (Ribo-Seq) is the gold standard for the annotation of novel coding sequences by reporting on smORF translation. In Ribo-Seq, ribosome-protected footprints (RPFs) that map to multiple sites in the genome are computationally removed since they cannot unambiguously be assigned to a specific genomic location, or to a specific transcript in the case of multiple isoforms. Furthermore, RPFs necessarily result in short (25-34 nucleotides) reads, increasing the chance of ambiguous and multi-mapping alignments, such that smORFs that reside in these regions cannot be identified by Ribo-Seq. Here, we show that the inclusion of proteogenomics to create a Ribosome Profiling and Proteogenomics Pipeline (RP3) bypasses this limitation to identify a group of microprotein-encoding smORFs that are missed by current Ribo-Seq pipelines. Moreover, we show that the microproteins identified by RP3 have different sequence compositions from Ribo-Seq only pipelines, which can affect proteomics identification. In aggregate, the creation of RP3 maximizes the detection and confidence of protein-encoding smORFs and microproteins.

Keyphrases