Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes.
Kaiyue ZhouBhagya Shree KottooriSeeya Awadhut MunjZhewei ZhangSorin DraghiciSuzan ArslanturkPublished in: Biology (2022)
Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.
Keyphrases
- electronic health record
- big data
- dna methylation
- acute myeloid leukemia
- deep learning
- young adults
- healthcare
- type diabetes
- computed tomography
- drinking water
- magnetic resonance imaging
- pain management
- data analysis
- mass spectrometry
- insulin resistance
- acute lymphoblastic leukemia
- binding protein
- health information
- convolutional neural network
- copy number