Imputation Efficacy Across Global Human Populations.
Jordan L CahoonXinyue RuiEcho TangChristopher SimonsJalen LangieMinhui ChenYing-Chu LoCharleston W K ChiangPublished in: bioRxiv : the preprint server for biology (2023)
Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of populations with non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative contains a substantial number of admixed African-ancestry and Hispanic/Latino samples to impute these populations with nearly the same efficacy as European-ancestry cohorts. However, imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we curated genome-wide array data from 23 publications published between 2008 to 2021. In total, we imputed over 43k individuals across 123 populations around the world. We identified a number of populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for 1-5% alleles in Saudi Arabians (N=1061), Vietnamese (N=1264), Thai (N=2435), and Papua New Guineans (N=776) were 0.79, 0.78, 0.76, and 0.62, respectively. In contrast, the mean Rsq ranged from 0.90 to 0.93 for comparable European populations matched in sample size and SNP content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European reference increased, as predicted. Further analysis using sequencing data as ground truth suggested that imputation software may inflate imputation quality for non-European populations, implying that these quality estimates may be lower than initially estimated. To improve imputation quality, we assessed a strategy using meta-imputation to combine results from TOPMed with smaller population-specific reference panels, using 1496 whole genome sequenced individuals from Taiwan Biobank as an example reference. While we found that meta-imputation in this design did not improve Rsq genome-wide, Southeast Asian populations such as Filipino and Vietnamese experience a 0.16 and 0.11 increase in average imputation Rsq, respectively, for alleles extremely rare in Europeans (< 0.1% frequency) but more common (>1%) in East Asians. Taken together, our analysis suggests that meta-imputation may complement a large reference panel such as that of TOPMed for underrepresented cohorts. Nevertheless, reference panels must ultimately strive to increase diversity and size to promote equity within genetics research.