sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs.
Ziwei ChenJun MengSiyuan ZhaoChao YinYushi LuanPublished in: Interdisciplinary sciences, computational life sciences (2023)
Long non-coding RNAs (lncRNAs) are important regulators of biological processes. It has recently been shown that some lncRNAs include small open reading frames (sORFs) that can encode small peptides of no more than 100 amino acids. However, existing methods are commonly applied to human and animal datasets and still suffer from low feature representation capability. Thus, accurate and credible prediction of sORFs with coding ability in plant lncRNAs is imperative. This paper proposes a new method termed sORFPred, in which we design a model named MCSEN by combining multi-scale convolution and Squeeze-and-Excitation Networks to fully mine distinct information embedded in sORFs, integrate and optimize multiple sequence-based and physicochemical feature descriptors, and built a two-layer prediction classifier based on Bayesian optimization algorithm and Extra Trees. sORFPred has been evaluated on sORFs datasets of three species and experimentally validated sORFs dataset. Results indicate that sORFPred outperforms existing methods and achieves 97.28% accuracy, 97.06% precision, 97.52% recall, and 97.29% F1-score on Arabidopsis thaliana, which shows a significant improvement in prediction performance compared to various conventional shallow machine learning and deep learning models.
Keyphrases
- machine learning
- deep learning
- long non coding rna
- neural network
- arabidopsis thaliana
- amino acid
- artificial intelligence
- network analysis
- convolutional neural network
- genome wide analysis
- genome wide identification
- endothelial cells
- poor prognosis
- big data
- minimally invasive
- working memory
- high resolution
- healthcare
- cell wall
- genetic diversity
- pluripotent stem cells