On polygenic risk scores for complex traits prediction.

Published in: Biometrics (2021)

Polygenic risk scores (PRS) have gained substantial attention for complex traits prediction in genome-wide association studies (GWAS). Motivated by the polygenic model of complex traits, we study the statistical properties of PRS under the high-dimensional but sparsity free setting where the triplet ( n , p , m ) → ( ∞ , ∞ , ∞ ) with n , p , m being the sample size, the number of assayed single-nucleotide polymorphisms (SNPs), and the number of assayed causal SNPs, respectively. First, we derive asymptotic results on the out-of-sample (prediction) R-squared for PRS. These results help understand the widespread observed gap between the in-sample heritability (or partial R-squared due to the genetic features) estimate and the out-of-sample R-squared for most complex traits. Next, we investigate how features should be selected (e.g., by a p-value threshold) for constructing optimal PRS. We reveal that the optimal threshold depends largely on the genetic architecture underlying the complex trait and the sample size of the training GWAS, or the m / n ratio. For highly polygenic traits with a large m / n ratio, it is difficult to separate causal and null SNPs and stringent feature selection in principle often leads to poor PRS prediction. We numerically illustrate the theoretical results with intensive simulation studies and real data analysis on 33 complex traits with a wide range of genetic architectures in the UK Biobank database.

Keyphrases