Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data.
Ken ChenShaoheng LiangJinzhuang DouRamiz IqbalPublished in: Research square (2023)
Clustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. The batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate BCD on simulated data as well as applied it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). BCD achieves more accurate clusters and better visualizations than state-of-the-art batch correction methods on longitudinal datasets. BCD can be directly integrated with most clustering and visualization methods to enable more scientific findings.
Keyphrases
- single cell
- gene expression
- rna seq
- coronavirus disease
- data analysis
- anaerobic digestion
- dna methylation
- end stage renal disease
- high throughput
- sars cov
- ejection fraction
- electronic health record
- chronic kidney disease
- newly diagnosed
- induced apoptosis
- big data
- respiratory syndrome coronavirus
- prognostic factors
- machine learning
- cross sectional
- oxidative stress
- cell cycle arrest