Augmenting Bacterial Similarity Measures Using a Graph-Based Genome Representation.
Vivek RamananIndra Neil SarkarPublished in: bioRxiv : the preprint server for biology (2024)
Relationships between bacterial taxa are traditionally defined using 16S rRNA nucleotide similarity or average nucleotide identity. Improvements in sequencing technology provides additional pairwise information on genome sequences, which may provide valuable information on genomic relationships. Mapping orthologous gene locations between genome pairs, known as synteny, is typically implemented in the discovery of new species and has not been systematically applied to bacterial genomes. Using a dataset of 378 bacterial genomes, we developed and tested a new measure of synteny similarity between a pair of genomes, which was scaled onto 16S rRNA distance using covariance matrices. Based on the input gene functions used (i.e., core, antibiotic resistance, and virulence), we observed varying topological arrangements of bacterial relationship networks by applying (1) complete linkage hierarchical clustering and (2) KNN graph structures to syntenic-scaled 16S data. Our metric improved clustering quality comparatively to state-of-the-art ANI metrics while preserving clustering assignments for the highest similarity relationships. Our findings indicate that syntenic relationships provide more granular and interpretable relationships for within-genera taxa compared to pairwise similarity measures, particularly in functional contexts.
Keyphrases
- genome wide
- single cell
- copy number
- rna seq
- staphylococcus aureus
- escherichia coli
- pseudomonas aeruginosa
- dna methylation
- healthcare
- convolutional neural network
- gene expression
- electronic health record
- high throughput
- social media
- deep learning
- big data
- cystic fibrosis
- quality improvement
- biofilm formation
- hiv infected