A hepatitis B virus (HBV) sequence variation graph improves sequence alignment and sample-specific consensus sequence construction for genetic analysis of HBV.
Dylan DuchenSteven ClipmanCandelaria VergaraChloe L ThioDavid L ThomasPriya DuggalGenevieve L WojcikPublished in: bioRxiv : the preprint server for biology (2023)
Hepatitis B virus (HBV) remains a global public health concern, with over 250 million individuals living with chronic HBV infection (CHB) and no curative therapy currently available. Viral diversity is associated with CHB pathogenesis and immunological control of infection. Improved methods to characterize the viral genome at both the population and intra-host level could aid drug development efforts. Conventionally, HBV sequencing data are aligned to a linear reference genome and only sequences capable of aligning to the reference are captured for analysis. Reference selection has additional consequences, including sample-specific 'consensus' sequence construction. It remains unclear how to select a reference from available sequences and whether a single reference is sufficient for genetic analyses. Using simulated short-read sequencing data generated from full-length publicly available HBV genome sequences and HBV sequencing data from a longitudinally sampled individual with CHB, we investigate alternative graph-based alignment approaches. We demonstrate that using a phylogenetically representative 'genome graph' for alignment, rather than linear reference sequences, avoids issues of reference ambiguity, improves alignment, and facilitates the construction of sample-specific consensus sequences genetically similar to an individual's infection. Graph-based methods can therefore improve efforts to characterize the genetics of viral pathogens, including HBV, and may have broad implications in host pathogen research.
Keyphrases
- hepatitis b virus
- liver failure
- public health
- genome wide
- convolutional neural network
- electronic health record
- single cell
- neural network
- clinical practice
- gene expression
- dna methylation
- quality improvement
- amino acid
- candida albicans
- drug induced
- multidrug resistant
- machine learning
- single molecule
- prognostic factors