A probabilistic graphical model for estimating selection coefficient of missense variants from human population sequence data.
Yige ZhaoGuojie ZhongJake HagenHongbing PanWendy K ChungYufeng ShenPublished in: medRxiv : the preprint server for health sciences (2023)
Accurately predicting the effect of missense variants is a central problem in interpretation of genomic variation. Commonly used computational methods does not capture the quantitative impact on fitness in populations. We developed MisFit to estimate missense fitness effect using biobank-scale human population genome data. MisFit jointly models the effect at molecular level ( d ) and population level (selection coefficient, s ), assuming that in the same gene, missense variants with similar d have similar s. MisFit is a probabilistic graphical model that integrates deep neural network components and population genetics models efficiently with inductive bias based on biological causality of variant effect. We trained it by maximizing probability of observed allele counts in 236,017 European individuals. We show that s is informative in predicting frequency across ancestries and consistent with the fraction of de novo mutations given s . Finally, MisFit outperforms previous methods in prioritizing missense variants in individuals with neurodevelopmental disorders.
Keyphrases
- copy number
- intellectual disability
- endothelial cells
- neural network
- genome wide
- physical activity
- body composition
- electronic health record
- autism spectrum disorder
- emergency department
- big data
- dna methylation
- peripheral blood
- computed tomography
- gene expression
- single molecule
- resistance training
- contrast enhanced
- genetic diversity