Leveraging gene correlations in single cell transcriptomic data.

Kai SilkwoodEmmanuel DollingerJosh GervinScott AtwoodQing NieArthur D Lander

Published in: bioRxiv : the preprint server for biology (2023)

Many approaches have been developed to overcome technical noise in single cell (and single nucleus) RNA-sequencing (scRNAseq). As researchers dig deeper into data- looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and a minimum of ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in the event that ground truth about biological variation is unknown (i.e., most of the time). Here we approach this problem analytically, based on the assumption that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We then analyze scRNAseq data without normalization-a step that can skew distributions, particular for sparse data-and calculate p -values associated with key statistics. We develop an improved method for the selection of features for cell clustering and the identification of gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur ( B asic Informatics and G ene S tatistics from U nnormalized R eads), accurately captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify tens of thousands of correlations that, when clustered without supervision into gene communities, both align with cellular components and biological processes, and point toward potentially novel cell biological relationships.

Keyphrases