Login / Signup

Accelerated Bayesian inference of population size history from recombining sequence data.

Jonathan Terhorst
Published in: bioRxiv : the preprint server for biology (2024)
I present PHLASH, a new Bayesian method for inferring population history from whole genome sequence data. PHLASH is p opulation h istory l earning by a veraging s ampled h istories: it works by drawing random, low-dimensional projections of the coalescent intensity function from the posterior distribution of a PSMC-like model, and averaging them together to form an accurate and adaptive size history estimator. On simulated data, PHLASH tends to be faster and have lower error than several competing methods including SMC++, MSMC2, and FITCOAL. Moreover, it provides a full posterior distribution over population size history, leading to automatic uncertainty quantification of the point estimates, as well to new Bayesian testing procedures for detecting population structure and ancient bottlenecks. On the technical side, the key advance is a novel algorithm for computing the score function (gradient of the log-likelihood) of a coalescent hidden Markov model: when there are M hidden states, the algorithm requires 𝒪 M 2 time and 𝒪 1 memory per decoded position, the same cost as evaluating the log-likelihood itself using the naïve forward algorithm. This algorithm is combined with a hand-tuned implementation that fully leverages the power of modern GPU hardware, and the entire method has been released as an easy-to-use Python software package.
Keyphrases
  • machine learning
  • deep learning
  • big data
  • neural network
  • electronic health record
  • artificial intelligence
  • data analysis
  • primary care
  • high resolution
  • amino acid