Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection.
Yuta HozumiKiyoto Aramis TanemuraGuo-Wei WeiPublished in: Journal of chemical information and modeling (2023)
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Keyphrases
- single cell
- rna seq
- high throughput
- electronic health record
- gene expression
- machine learning
- induced apoptosis
- genome wide
- big data
- deep learning
- healthcare
- cell cycle arrest
- stem cells
- dna methylation
- mental health
- magnetic resonance imaging
- computed tomography
- genome wide identification
- mesenchymal stem cells
- endoplasmic reticulum stress
- data analysis
- magnetic resonance