Interpretable detection of novel human viruses from genome sequencing data.
Jakub M BartoszewiczAnja SeidelBernhard Y RenardPublished in: NAR genomics and bioinformatics (2021)
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Keyphrases
- machine learning
- sars cov
- deep learning
- endothelial cells
- artificial intelligence
- big data
- single cell
- induced pluripotent stem cells
- pluripotent stem cells
- electronic health record
- convolutional neural network
- genome wide
- lymph node
- dna methylation
- coronavirus disease
- mass spectrometry
- medical students
- cell free
- social media