Fast and accurate out-of-core PCA framework for large-scale biobank data.
Zilong LiJonas MeisnerAnders AlbrechtsenPublished in: Genome research (2023)
Principal Component Analysis (PCA) is widely utilized in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel Randomized Singular Value Decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world datasets in different fields, we demonstrate the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common SNPs, demonstrate that PCAone accurately computes the top 40 principal components within 9 hours. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing less than 20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 minutes. This performance represents a 10-fold improvement over state-of-the-art tools.
Keyphrases
- single cell
- machine learning
- big data
- rna seq
- electronic health record
- deep learning
- working memory
- artificial intelligence
- induced apoptosis
- open label
- data analysis
- high resolution
- randomized controlled trial
- gene expression
- dna damage
- genome wide
- dna methylation
- oxidative stress
- cell proliferation
- clinical trial
- endoplasmic reticulum stress