Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.
Andrea CraccoAlexandru I TomescuPublished in: Genome research (2023)
Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences, and associate to each k -mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k -mer counting step with the unitig construction step, and on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3-21× compared to the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5-39× compared to the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.
Keyphrases