Enhanced Compression of k -Mer Sets with Counters via de Bruijn Graphs.
Enrico RossignoloMatteo CominPublished in: Journal of computational biology : a journal of computational molecular cell biology (2024)
An essential task in computational genomics involves transforming input sequences into their constituent k -mers. The quest for an efficient representation of k -mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k -mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k -mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k -mer size, and it is up to 7 × times faster.