GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data.
Mehrtash BabadiJack M FuSamuel K LeeAndrey N SmirnovLaura D GauthierMark A WalkerDavid I BenjaminXuefang ZhaoKonrad J KarczewskiIsaac WongRyan L CollinsAlba Sanchis-JuanHarrison BrandEric BanksMichael E TalkowskiPublished in: Nature genetics (2023)
Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.
Keyphrases
- copy number
- genome wide
- mitochondrial dna
- electronic health record
- dna methylation
- big data
- genetic diversity
- single cell
- machine learning
- small molecule
- endothelial cells
- gene expression
- artificial intelligence
- autism spectrum disorder
- quantum dots
- cross sectional
- intellectual disability
- transcription factor
- optical coherence tomography
- energy transfer