A deep catalogue of protein-coding variation in 983,578 individuals.
Kathie Y SunXiaodong BaiSiying ChenSuying BaoChuanyi ZhangManav KapoorJoshua D BackmanTyler JosephEvan MaxwellGeorge MitraAlexander GorovitsAdam MansfieldBoris BoutkovSujit GokhaleLukas HabeggerAnthony MarckettaAdam E LockeLiron GanelAlicia HawesMichael D KesslerDeepika SharmaJeffrey StaplesJonas BovijnSahar GelfmanAlessandro Di GioiaVeera Manikandan RajagopalAlexander LopezJennifer Rico VarelaJesus AlegreJaime Berumen-CamposRoberto Tapia-ConyerPablo Kuri-MoralesJason M TorresJonathan R EmbersonRory Collinsnull nullnull nullMichael N CantorTimothy A ThorntonHyun Min KangJohn D OvertonAlan R ShuldinerM Laura CremonaMona NafdeAris BarasGonçalo R AbecasisJonathan MarchiniJeffrey G ReidWilliam J SalernoSuganthi BalasubramanianPublished in: Nature (2024)
Rare coding variants that significantly impact function provide insights into the biology of a gene 1-3 . However, ascertaining their frequency requires large sample sizes 4-8 . Here, we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. 23% of the Regeneron Genetics Center Million Exome data (RGC-ME) comes from non-European individuals of African, East Asian, Indigenous American, Middle Eastern, and South Asian ancestry. This catalogue includes over 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss-of-function, we identify 3,988 loss-of-function intolerant genes, including 86 that were previously assessed as tolerant and 1,153 lacking established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions depleted of missense variants despite being tolerant to pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this important resource of coding variation from the RGC-ME accessible via a public variant allele frequency browser.
Keyphrases
- copy number
- genome wide
- high resolution
- intellectual disability
- dna methylation
- genome wide identification
- healthcare
- emergency department
- gene expression
- mental health
- autism spectrum disorder
- transcription factor
- single cell
- binding protein
- south africa
- genome wide analysis
- high speed
- tandem mass spectrometry
- induced pluripotent stem cells