Reference-Free Plant Disease Detection Using Machine Learning and Long-Read Metagenomic Sequencing.
Marcela A JohnsonBoris A VinatzerSong LiPublished in: Applied and environmental microbiology (2023)
Surveillance for early disease detection is crucial to reduce the threat of plant diseases to food security. Metagenomic sequencing and taxonomic classification have recently been used to detect and identify plant pathogens. However, for an emerging pathogen, its genome may not be similar enough to any public genome to permit reference-based tools to identify infected samples. Also, in the case of point-of care diagnosis in the field, database access may be limited. Therefore, here we explore reference-free detection of plant pathogens using metagenomic sequencing and machine learning (ML). We used long-read metagenomes from healthy and infected plants as our model system and constructed k-mer frequency tables to test eight different ML models. The accuracy in classifying individual reads as coming from a healthy or infected metagenome were compared. Of all models, random forest (RF) had the best combination of short run-time and high accuracy (over 0.90) using tomato metagenomes. We further evaluated the RF model with a different tomato sample infected with the same pathogen or a different pathogen and a grapevine sample infected with a grapevine pathogen and achieved similar performances. ML models can thus learn features to successfully perform reference-free detection of plant diseases whereby a model trained with one pathogen-host system can also be used to detect different pathogens on different hosts. Potential and challenges of applying ML to metagenomics in plant disease detection are discussed. IMPORTANCE Climate change may lead to the emergence of novel plant diseases caused by yet unknown pathogens. Surveillance for emerging plant diseases is crucial to reduce their threat to food security. However, conventional genomic based methods require knowledge of existing plant pathogens and cannot be applied to detecting newly emerged pathogens. In this work, we explored reference-free, meta-genomic sequencing-based disease detection using machine learning. By sequencing the genomes of all microbial species extracted from an infected plant sample, we were able to train machine learning models to accurately classify individual sequencing reads as coming from a healthy or an infected plant sample. This method has the potential to be integrated into a generic pipeline for a meta-genomic based plant disease surveillance approach but also has limitations that still need to be overcome.