Login / Signup

Compression for population genetic data through finite-state entropy.

Winfield ChenLloyd T Elliott
Published in: Journal of bioinformatics and computational biology (2021)
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between [Formula: see text] and [Formula: see text] speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.
Keyphrases
  • electronic health record
  • genome wide
  • big data
  • machine learning
  • copy number
  • data analysis
  • gene expression
  • dna methylation
  • working memory
  • preterm infants