Efficient mapping of accurate long reads in minimizer space with mapquik.

Barış Ekim Kristoffer Sahlin Paul Medvedev Bonnie Berger Rayan Chikhi

Published in: Genome research (2023)

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., PacBio HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers ( k -min-mers) and only indexing k -min-mers that occur once in the reference genome, thereby unlocking ultra-fast mapping while retaining high sensitivity. We demonstrate that mapquik significantly accelerates the seeding and chaining steps - fundamental bottlenecks to read mapping - for both the human and maize genomes with > 96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37× speed-up over the state-of-the-art tool minimap2, and on the maize genome, a 410× speed-up over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled not only from minimizer-space seeding but also a novel heuristic O(n) pseudo-chaining algorithm, which improves upon the long-standing O(n log n) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Keyphrases