TDC-2: Multimodal Foundation for Therapeutic Science.
Alejandro Velez-ArceKexin HuangMichelle M LiXiang LinWenhao GaoTianfan FuManolis KellisBradley L PenteluteMarinka ZitnikPublished in: bioRxiv : the preprint server for biology (2024)
Therapeutics Data Commons (tdcommons.ai) is an open science initiative with unified datasets, AI models, and benchmarks to support research across therapeutic modalities and drug discovery and development stages. The Commons 2.0 (TDC-2) is a comprehensive overhaul of Therapeutic Data Commons to catalyze research in multimodal models for drug discovery by unifying single-cell biology of diseases, biochemistry of molecules, and effects of drugs through multimodal datasets, AI-powered API endpoints, new multimodal tasks and model frameworks, and comprehensive benchmarks. TDC-2 introduces over 1,000 multimodal datasets spanning approximately 85 million cells, pre-calculated embeddings from 5 state-of-the-art single-cell models, and a biomedical knowledge graph. TDC-2 drastically expands the coverage of ML tasks across therapeutic pipelines and 10+ new modalities, spanning but not limited to single-cell gene expression data, clinical trial data, peptide sequence data, peptidomimetics protein-peptide interaction data regarding newly discovered ligands derived from AS-MS spectroscopy, novel 3D structural data for proteins, and cell-type-specific protein-protein interaction networks at single-cell resolution. TDC-2 introduces multimodal data access under an API-first design using the model-view-controller paradigm. TDC-2 introduces 7 novel ML tasks with fine-grained biological contexts: contextualized drug-target identification, single-cell chemical/genetic perturbation response prediction, protein-peptide binding affinity prediction task, and clinical trial outcome prediction task, which introduce antigen-processing-pathway-specific, cell-type-specific, peptide-specific, and patient-specific biological contexts. TDC-2 also releases benchmarks evaluating 15+ state-of-the-art models across 5+ new learning tasks evaluating models on diverse biological contexts and sampling approaches. Among these, TDC-2 provides the first benchmark for context-specific learning. TDC-2, to our knowledge, is also the first to introduce a protein-peptide binding interaction benchmark.
Keyphrases
- single cell
- electronic health record
- rna seq
- clinical trial
- protein protein
- drug discovery
- gene expression
- big data
- pain management
- healthcare
- artificial intelligence
- high throughput
- emergency department
- dna methylation
- binding protein
- randomized controlled trial
- public health
- data analysis
- machine learning
- ms ms
- genome wide
- multiple sclerosis
- cell death
- study protocol
- deep learning
- oxidative stress
- endoplasmic reticulum stress
- air pollution
- transcription factor
- phase iii
- upper limb
- double blind
- virtual reality