Login / Signup

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools.

Hao HouBrent PedersenAaron R Quinlan
Published in: Nature computational science (2021)
Modern DNA sequencing is used as a readout for diverse assays, with the count of aligned sequences (read depth) representing the quantitative signal for each underlying cellular phenomena. Existing data formats for quantitative genomics assays are, however, limited in either the analysis speeds they enable, the disk space they require or both. We have developed the dense depth data dump (D4) format and tool suite, with the goal of balancing improved analysis speeds with file size. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input sequence file to determine an optimal encoding that enables fast data access. We demonstrate that the D4 format offers substantial speed improvements over existing formats for random access, aggregation and summarization, while also achieving better or comparable file sizes. This performance enables scalable downstream analyses that would be otherwise difficult.
Keyphrases
  • electronic health record
  • big data
  • single cell
  • high resolution
  • optical coherence tomography
  • high throughput
  • data analysis
  • artificial intelligence
  • deep learning
  • amino acid
  • circulating tumor