Capturing large genomic contexts for accurately predicting enhancer-promoter interactions.
Ken ChenHuiying ZhaoYuedong YangPublished in: Briefings in bioinformatics (2022)
Enhancer-promoter interaction (EPI) is a key mechanism underlying gene regulation. EPI prediction has always been a challenging task because enhancers could regulate promoters of distant target genes. Although many machine learning models have been developed, they leverage only the features in enhancers and promoters, or simply add the average genomic signals in the regions between enhancers and promoters, without utilizing detailed features between or outside enhancers and promoters. Due to a lack of large-scale features, existing methods could achieve only moderate performance, especially for predicting EPIs in different cell types. Here, we present a Transformer-based model, TransEPI, for EPI prediction by capturing large genomic contexts. TransEPI was developed based on EPI datasets derived from Hi-C or ChIA-PET data in six cell lines. To avoid over-fitting, we evaluated the TransEPI model by testing it on independent test datasets where the cell line and chromosome are different from the training data. TransEPI not only achieved consistent performance across the cross-validation and test datasets from different cell types but also outperformed the state-of-the-art machine learning and deep learning models. In addition, we found that the improved performance of TransEPI was attributed to the integration of large genomic contexts. Lastly, TransEPI was extended to study the non-coding mutations associated with brain disorders or neural diseases, and we found that TransEPI was also useful for predicting the target genes of non-coding mutations.
Keyphrases
- machine learning
- copy number
- deep learning
- transcription factor
- big data
- single cell
- genome wide
- rna seq
- dna methylation
- artificial intelligence
- cell therapy
- electronic health record
- gene expression
- computed tomography
- binding protein
- genome wide identification
- pet ct
- stem cells
- white matter
- bioinformatics analysis
- mesenchymal stem cells
- positron emission tomography
- bone marrow
- blood brain barrier
- high intensity
- data analysis
- brain injury
- genome wide analysis
- virtual reality