DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors.
Anowarul KabirManish BhattaraiSelma PetersonYonatan Najman-LichtKim Ø RasmussenAmarda ShehuAlan R BishopBoian S AlexandrovAnny UshevaPublished in: Nucleic acids research (2024)
It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.
Keyphrases
- circulating tumor
- transcription factor
- cell free
- genome wide
- single molecule
- high throughput
- deep learning
- single cell
- circulating tumor cells
- endothelial cells
- nucleic acid
- dna binding
- gene expression
- dna methylation
- stem cells
- artificial intelligence
- binding protein
- dna damage
- machine learning
- high resolution
- big data
- oxidative stress
- induced pluripotent stem cells
- body composition
- convolutional neural network
- high intensity