MEM-based pangenome indexing for k -mer queries.
Stephen HwangNathaniel K BrownOmar Y AhmedKatharine M JenikeSam KovakaMichael C SchatzBen LangmeadPublished in: bioRxiv : the preprint server for biology (2024)
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k -mers and de Bruijn graphs are limited to answering questions at a specific substring length k . We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k -mer presence/absence (membership queries) and that count the number of genomes containing k -mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 × smaller than a comparable KMC3 index and 11.4 × smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5x faster than other approaches. MEMO's small index size, lack of k -mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.