DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery.
Hyun Jae ChoZhenjia WangYidan CongStefan BekiranovAidong ZhangChongzhi ZangPublished in: Genes (2024)
Characterization of gene regulatory mechanisms in cancer is a key task in cancer genomics. CCCTC-binding factor (CTCF), a DNA binding protein, exhibits specific binding patterns in the genome of cancer cells and has a non-canonical function to facilitate oncogenic transcription programs by cooperating with transcription factors bound at flanking distal regions. Identification of DNA sequence features from a broad genomic region that distinguish cancer-specific CTCF binding sites from regular CTCF binding sites can help find oncogenic transcription factors in a cancer type. However, the presence of long DNA sequences without localization information makes it difficult to perform conventional motif analysis. Here, we present DNAResDualNet (DARDN), a computational method that utilizes convolutional neural networks (CNNs) for predicting cancer-specific CTCF binding sites from long DNA sequences and employs DeepLIFT, a method for interpretability of deep learning models that explains the model's output in terms of the contributions of its input features. The method is used for identifying DNA sequence features associated with cancer-specific CTCF binding. Evaluation on DNA sequences associated with CTCF binding sites in T-cell acute lymphoblastic leukemia (T-ALL) and other cancer types demonstrates DARDN's ability in classifying DNA sequences surrounding cancer-specific CTCF binding from control constitutive CTCF binding and identifying sequence motifs for transcription factors potentially active in each specific cancer type. We identify potential oncogenic transcription factors in T-ALL, acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung adenocarcinoma (LUAD), and prostate cancer (PRAD). Our work demonstrates the power of advanced machine learning and feature discovery approach in finding biologically meaningful information from complex high-throughput sequencing data.
Keyphrases
- papillary thyroid
- transcription factor
- deep learning
- machine learning
- prostate cancer
- squamous cell
- acute myeloid leukemia
- acute lymphoblastic leukemia
- circulating tumor
- healthcare
- single molecule
- cell free
- dna binding
- gene expression
- high throughput
- small molecule
- minimally invasive
- public health
- big data
- allogeneic hematopoietic stem cell transplantation
- risk assessment
- genome wide
- young adults
- social media
- human health