Login / Signup

Direct estimation and inference of higher-level correlations from lower-level measurements with applications in gene-pathway and proteomics studies.

Yue WangHaoran Shi
Published in: Biostatistics (Oxford, England) (2024)
This paper tackles the challenge of estimating correlations between higher-level biological variables (e.g. proteins and gene pathways) when only lower-level measurements are directly observed (e.g. peptides and individual genes). Existing methods typically aggregate lower-level data into higher-level variables and then estimate correlations based on the aggregated data. However, different data aggregation methods can yield varying correlation estimates as they target different higher-level quantities. Our solution is a latent factor model that directly estimates these higher-level correlations from lower-level data without the need for data aggregation. We further introduce a shrinkage estimator to ensure the positive definiteness and improve the accuracy of the estimated correlation matrix. Furthermore, we establish the asymptotic normality of our estimator, enabling efficient computation of P-values for the identification of significant correlations. The effectiveness of our approach is demonstrated through comprehensive simulations and the analysis of proteomics and gene expression datasets. We develop the R package highcor for implementing our method.
Keyphrases
  • gene expression
  • electronic health record
  • randomized controlled trial
  • systematic review
  • mass spectrometry
  • genome wide
  • dna methylation
  • copy number
  • machine learning
  • deep learning
  • single cell
  • transcription factor