A method to build extended sequence context models of point mutations and indels.
Jörn BethuneApril Snøfrid KleppeSoren BesenbacherPublished in: Nature communications (2022)
The mutation rate of a specific position in the human genome depends on the sequence context surrounding it. Modeling the mutation rate by estimating a rate for each possible k-mer, however, only works for small values of k since the data becomes too sparse for larger values of k. Here we propose a new method that solves this problem by grouping similar k-mers. We refer to the method as k-mer pattern partition and have implemented it in a software package called kmerPaPa. We use a large set of human de novo mutations to show that this new method leads to improved prediction of mutation rates and makes it possible to create models using wider sequence contexts than previous studies. As the first method of its kind, it does not only predict rates for point mutations but also insertions and deletions. We have additionally created a software package called Genovo that, given a k-mer pattern partition model, predicts the expected number of synonymous, missense, and other functional mutation types for each gene. Using this software, we show that the created mutation rate models increase the statistical power to detect genes containing disease-causing variants and to identify genes under strong selective constraint.