DeepG4: A deep learning approach to predict cell-type specific active G-quadruplex regions.
Vincent RocherMatthieu GenaisElissar NassereddineRaphaël MouradPublished in: PLoS computational biology (2021)
DNA is a complex molecule carrying the instructions an organism needs to develop, live and reproduce. In 1953, Watson and Crick discovered that DNA is composed of two chains forming a double-helix. Later on, other structures of DNA were discovered and shown to play important roles in the cell, in particular G-quadruplex (G4). Following genome sequencing, several bioinformatic algorithms were developed to map G4s in vitro based on a canonical sequence motif, G-richness and G-skewness or alternatively sequence features including k-mers, and more recently machine/deep learning. Recently, new sequencing techniques were developed to map G4s in vitro (G4-seq) and G4s in vivo (G4 ChIP-seq) at few hundred base resolution. Here, we propose a novel convolutional neural network (DeepG4) to map cell-type specific active G4 regions (e.g. regions within which G4s form both in vitro and in vivo). DeepG4 is very accurate to predict active G4 regions in different cell types. Moreover, DeepG4 identifies key DNA motifs that are predictive of G4 region activity. We found that such motifs do not follow a very flexible sequence pattern as current algorithms seek for. Instead, active G4 regions are determined by numerous specific motifs. Moreover, among those motifs, we identified known transcription factors (TFs) which could play important roles in G4 activity by contributing either directly to G4 structures themselves or indirectly by participating in G4 formation in the vicinity. In addition, we used DeepG4 to predict active G4 regions in a large number of tissues and cancers, thereby providing a comprehensive resource for researchers. Availability: https://github.com/morphos30/DeepG4.
Keyphrases
- deep learning
- single cell
- convolutional neural network
- circulating tumor
- single molecule
- cell free
- machine learning
- genome wide
- rna seq
- artificial intelligence
- high throughput
- transcription factor
- high resolution
- cell therapy
- gene expression
- nucleic acid
- circulating tumor cells
- stem cells
- mesenchymal stem cells
- high density
- dna methylation
- coronavirus disease
- dna binding
- bone marrow