Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data.
Shaoheng LiangJinzhuang DouRamiz IqbalKen ChenPublished in: Communications biology (2024)
Clustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. The batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Label-Aware Distance (LAD), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate LAD on simulated data as well as apply it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). LAD provides better cell embedding than state-of-the-art batch correction methods on longitudinal datasets. It can be used in distance-based clustering and visualization methods to combine the power of multiple samples to help make biological findings.
Keyphrases
- single cell
- gene expression
- rna seq
- coronavirus disease
- data analysis
- end stage renal disease
- high throughput
- dna methylation
- chronic kidney disease
- induced apoptosis
- electronic health record
- big data
- peritoneal dialysis
- ejection fraction
- prognostic factors
- newly diagnosed
- machine learning
- oxidative stress
- radiation therapy
- endoplasmic reticulum stress
- diabetic retinopathy
- cross sectional
- respiratory syndrome coronavirus
- cell proliferation
- artificial intelligence
- radiation induced
- bone marrow
- deep learning
- pi k akt
- cell therapy
- optic nerve