SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies.
Derya Aytan-AktugVladislav GrigorjevJudit SzarvasPhilip Thomas Lanken Conradsen ClausenPatrick MunkMarcus NguyenJames J DavisFrank Møller AarestrupOle LundPublished in: Microbiology spectrum (2022)
High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k -mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k -mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/). IMPORTANCE Extra-chromosomal genes encoding antimicrobial resistance, metal resistance, and virulence provide selective advantages for bacterial survival under stress conditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative measures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.
Keyphrases
- escherichia coli
- copy number
- genome wide
- antimicrobial resistance
- healthcare
- mental health
- dna methylation
- machine learning
- high throughput
- crispr cas
- genome wide identification
- klebsiella pneumoniae
- public health
- endothelial cells
- climate change
- big data
- sars cov
- primary care
- gene expression
- staphylococcus aureus
- biofilm formation
- single cell
- pseudomonas aeruginosa
- risk assessment
- health information
- amino acid
- transcription factor
- drinking water
- human health
- microbial community
- coronavirus disease
- induced pluripotent stem cells
- health insurance
- wastewater treatment
- pluripotent stem cells
- candida albicans
- tissue engineering
- resistance training