NOMAD2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads.

Marek Kokot Roozbeh Dehghannasiri Tavor Z Baharav Julia Salzman Sebastian Deorowicz

Published in: bioRxiv : the preprint server for biology (2023)

NOMAD is a new, unsupervised, reference-free, and unifying algorithm that discovers regulated sequence variation through statistical analysis of k -mer composition in DNA or RNA sequencing experiments. It subsumes many application-specific algorithms, from splicing detection to RNA editing to applications in DNA-sequencing and beyond. Here, we introduce NOMAD2, a fast, scalable, and user-friendly implementation of NOMAD based on KMC, an efficient k-mer counting approach. The pipeline has minimal installation requirements, and can be executed with a single command. NOMAD2 enables efficient analysis of massive RNA-Seq datasets where it reveals novel biology, showcased by rapid analysis of 1,553 human muscle cells, the entire Cancer Cell Line Encyclopedia (671 cell lines, 5.7 TB) and a deep RNAseq study of Amyotrophic Lateral Sclerosis (ALS) with âˆ¼2 fold less computational resource and time than state of the art alignment methods. NOMAD2 enables reference-free biological discovery at unmatched scale and speed. By bypassing genome alignment, we provide examples of its new insights into RNA expression in normal and disease tissue, to introduce NOMAD2 to enable expansive biological discovery not previously possible.

Keyphrases