Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention.
Anowarul KabirManish BhattaraiKim Ø RasmussenAmarda ShehuAlan R BishopBoian S AlexandrovAnny UshevaPublished in: bioRxiv : the preprint server for biology (2024)
Understanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. Our study delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. We developed a multi-modal deep learning model integrating these properties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in vivo involving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over 660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA biophysical properties explicitly. Further, we expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (SELEX) and Protein Binding Microarray (PBM) datasets, comparing our model with established frameworks. The inclusion of DNA breathing features consistently improved TF binding predictions across different cell lines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism provided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in genome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding and the effectiveness of multi-modal deep learning models in gene regulation studies.
Keyphrases
- dna binding
- transcription factor
- circulating tumor
- cell free
- high throughput
- single molecule
- deep learning
- rna seq
- single cell
- genome wide
- circulating tumor cells
- binding protein
- randomized controlled trial
- working memory
- nucleic acid
- endothelial cells
- copy number
- dna damage
- electronic health record
- gene expression
- artificial intelligence
- small molecule
- systematic review
- convolutional neural network