Strain Tracking with Uncertainty Quantification.
Younhun KimColin J WorbySawal AcharyaLucas R van DijkDaniel AlfonsettiZackary GromkoPhilippe AzimzadehKaren DodsonGeorg K GerberScott HultgrenAshlee M EarlBonnie BergerTravis E GibsonPublished in: bioRxiv : the preprint server for biology (2023)
The ability to detect and quantify microbiota over time has a plethora of clinical, basic science, and public health applications. One of the primary means of tracking microbiota is through sequencing technologies. When the microorganism of interest is well characterized or known a priori , targeted sequencing is often used. In many applications, however, untargeted bulk (shotgun) sequencing is more appropriate; for instance, the tracking of infection transmission events and nucleotide variants across multiple genomic loci, or studying the role of multiple genes in a particular phenotype. Given these applications, and the observation that pathogens (e.g. Clostridioides difficile, Escherichia coli, Salmonella enterica ) and other taxa of interest can reside at low relative abundance in the gastrointestinal tract, there is a critical need for algorithms that accurately track low-abundance taxa with strain level resolution. Here we present a sequence quality- and time-aware model, ChronoStrain , that introduces uncertainty quantification to gauge low-abundance species and significantly outperforms the current state-of-the-art on both real and synthetic data. ChronoStrain leverages sequences' quality scores and the samples' temporal information to produce a probability distribution over abundance trajectories for each strain tracked in the model. We demonstrate Chronostrain's improved performance in capturing post-antibiotic E. coli strain blooms among women with recurrent urinary tract infections (UTIs) from the UTI Microbiome (UMB) Project. Other strain tracking models on the same data either show inconsistent temporal colonization or can only track consistently using very coarse groupings. In contrast, our probabilistic outputs can reveal the relationship between low-confidence strains present in the sample that cannot be reliably assigned a single reference label (either due to poor coverage or novelty) while simultaneously calling high-confidence strains that can be unambiguously assigned a label. We also include and analyze newly sequenced cultured samples from the UMB Project.
Keyphrases
- escherichia coli
- public health
- urinary tract infection
- single cell
- quality improvement
- antibiotic resistance genes
- genome wide
- machine learning
- magnetic resonance
- big data
- copy number
- endothelial cells
- deep learning
- gene expression
- mass spectrometry
- clostridium difficile
- magnetic resonance imaging
- depressive symptoms
- microbial community
- wastewater treatment
- computed tomography
- social media
- dna methylation
- biofilm formation
- high resolution mass spectrometry
- gas chromatography mass spectrometry
- transcription factor
- health insurance
- drug delivery
- gas chromatography