Supervised learning of high-confidence phenotypic subpopulations from single-cell data.
Tao RenCanping ChenAlexey V DanilovSusan LiuXiangnan GuanShunyi DuXiwei WuMara H ShermanPaul T SpellmanLisa M CoussensAndrew C AdeyGordon B MillsLing-Yun WuZheng XiaPublished in: bioRxiv : the preprint server for biology (2023)
Accurately identifying phenotype-relevant cell subsets from heterogeneous cell populations is crucial for delineating the underlying mechanisms driving biological or clinical phenotypes. Here, by deploying a learning with rejection strategy, we developed a novel supervised learning framework called PENCIL to identify subpopulations associated with categorical or continuous phenotypes from single-cell data. By embedding a feature selection function into this flexible framework, for the first time, we were able to select informative features and identify cell subpopulations simultaneously, which enables the accurate identification of phenotypic subpopulations otherwise missed by methods incapable of concurrent gene selection. Furthermore, the regression mode of PENCIL presents a novel ability for supervised phenotypic trajectory learning of subpopulations from single-cell data. We conducted comprehensive simulations to evaluate PENCIL’s versatility in simultaneous gene selection, subpopulation identification and phenotypic trajectory prediction. PENCIL is fast and scalable to analyze 1 million cells within 1 hour. Using the classification mode, PENCIL detected T-cell subpopulations associated with melanoma immunotherapy outcomes. Moreover, when applied to scRNA-seq of a mantle cell lymphoma patient with drug treatment across multiple time points, the regression mode of PENCIL revealed a transcriptional treatment response trajectory. Collectively, our work introduces a scalable and flexible infrastructure to accurately identify phenotype-associated subpopulations from single-cell data.
Keyphrases
- single cell
- rna seq
- machine learning
- high throughput
- electronic health record
- big data
- genome wide
- gene expression
- blood pressure
- copy number
- data analysis
- stem cells
- mass spectrometry
- mesenchymal stem cells
- radiation therapy
- dna methylation
- squamous cell carcinoma
- insulin resistance
- signaling pathway
- bone marrow
- artificial intelligence
- transcription factor
- genome wide identification
- cell proliferation
- molecular dynamics
- high resolution
- smoking cessation
- drug induced
- weight loss
- replacement therapy
- endoplasmic reticulum stress
- combination therapy
- locally advanced