Login / Signup

Efficient minimizer orders for large values of k using minimum decycling sets.

David PellowLianrong PuBarış EkimLior KotlarBonnie BergerRon ShamirYaron Orenstein
Published in: Genome research (2023)
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k -mer in every L -long subsequence of the target sequence, where minimality is with respect to a predefined k -mer order. Commonly used minimizer orders select more k -mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k -mer hitting sets produce minimizer orders with fewer selected k -mers. Generating compact universal k -mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling - set - based minimizer orders : new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k -mers comparable to that of minimizer orders based on universal k -mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k -mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
Keyphrases
  • high throughput
  • sars cov
  • working memory
  • machine learning
  • single cell
  • high resolution
  • electronic health record
  • healthcare
  • big data
  • circulating tumor
  • artificial intelligence
  • data analysis
  • low cost