Login / Signup

ViralVectors: compact and scalable alignment-free virome feature generation.

Sarwan AliPrakash ChourasiaZahra TayebiBabatunde BelloMurray Patterson
Published in: Medical & biological engineering & computing (2023)
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping - to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks. Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.
Keyphrases