Login / Signup

Precious2GPT: the combination of multiomics pretrained transformer and conditional diffusion for artificial multi-omics multi-species multi-tissue sample generation.

Denis SidorenkoStefan PushkovAkhmed SakipGeoffrey Ho Duen LeungSarah Wing Yan LokAnatoly UrbanDiana ZagirovaAlexander VeviorskiyNina TihonovaAleksandr KalashnikovEkaterina KozlovaVladimir NaumovFrank W PunAlex AliperFeng RenAlex Zhavoronkov
Published in: npj aging (2024)
Synthetic data generation in omics mimics real-world biological data, providing alternatives for training and evaluation of genomic analysis tools, controlling differential expression, and exploring data architecture. We previously developed Precious1GPT, a multimodal transformer trained on transcriptomic and methylation data, along with metadata, for predicting biological age and identifying dual-purpose therapeutic targets potentially implicated in aging and age-associated diseases. In this study, we introduce Precious2GPT, a multimodal architecture that integrates Conditional Diffusion (CDiffusion) and decoder-only Multi-omics Pretrained Transformer (MoPT) models trained on gene expression and DNA methylation data. Precious2GPT excels in synthetic data generation, outperforming Conditional Generative Adversarial Networks (CGANs), CDiffusion, and MoPT. We demonstrate that Precious2GPT is capable of generating representative synthetic data that captures tissue- and age-specific information from real transcriptomics and methylomics data. Notably, Precious2GPT surpasses other models in age prediction accuracy using the generated data, and it can generate data beyond 120 years of age. Furthermore, we showcase the potential of using this model in identifying gene signatures and potential therapeutic targets in a colorectal cancer case study.
Keyphrases
  • electronic health record
  • gene expression
  • dna methylation
  • big data
  • healthcare
  • machine learning
  • cross sectional
  • resistance training
  • human health