Login / Signup

Minimal positional substring cover is a haplotype threading alternative to Li and Stephens Model.

Ahsan SanaullahDegui ZhiShaojie Zhang
Published in: Genome research (2023)
The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel. For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics. However, LS becomes inefficient when sample size is large, because of its linear time complexity. Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
Keyphrases
  • machine learning
  • solid state
  • drug delivery
  • electronic health record
  • deep learning
  • big data
  • gene expression
  • dna methylation
  • body composition