Detecting and Adjusting for Hidden Biases due to Phenotype Misclassification in Genome-Wide Association Studies.

David Burstein Gabriel E HoffmanDeepika MathurSanan VenkateshKaren TherrienAyman H FanousTim B BigdeliPhilip D HarveyPanos RoussosGeorgios Voloudakis

Published in: medRxiv : the preprint server for health sciences (2023)

With the advent of healthcare-based genotyped biobanks, genome-wide association studies (GWAS) leverage larger sample sizes, incorporate patients with diverse ancestries and introduce noisier phenotypic definitions. Yet the extent and impact of phenotypic misclassification on large-scale datasets is not currently well understood due to a lack of statistical methods to estimate relevant parameters from empirical data. Here, we develop a statistical method and scalable software, PheMED, Phe notypic M easurement of E ffective D ilution, to quantify phenotypic misclassification across GWAS using only summary statistics. We illustrate how the parameters estimated by PheMED relate to the negative and positive predictive value of the labeled phenotype, compared to ground truth, and how misclassification of the phenotype yields diluted effect-sizes of variant-phenotype associations. Furthermore, we apply our methodology to detect multiple instances of statistically significant dilution in real-world data. We demonstrate how effective dilution biases downstream GWAS replication and heritability analyses despite utilizing current best practices, and provide a dilution-aware meta-analysis approach that outperforms existing methods. Consequently, we anticipate that PheMED will be a valuable tool for researchers to address phenotypic data quality issues both within and across cohorts.

Keyphrases