A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.
Yong ZhouNagarajan KathiresanZhichao YuLuis F RiveraYujian YangManjula ThimmaKeerthana ManickamDmytro ChebotarovRamil MauleonKapeel ChouguleSharon WeiTingting GaoCarl D GreenAndrea ZuccoloWeibo XieDoreen WareJianwei ZhangKenneth L McNallyRod A WingPublished in: BMC biology (2024)
This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.