Analysis-ready VCF at Biobank scale using Zarr.

Eric A Czech Timothy R MillarTom WhiteBen Jeffery Alistair Miles Sam Tallman Rafal Wojdyla Shadi Zabad Jeff Hammerbacher Jerome Kelleher

Published in: bioRxiv : the preprint server for biology (2024)

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

Keyphrases

human health
climate change
electronic health record
working memory
risk assessment
machine learning
deep learning