Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.
Corentin MeyerNicolas ScalzittiAnne Jeannin-GirardonPierre ColletOlivier PochJulie Dawn ThompsonPublished in: BMC bioinformatics (2020)
Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.