Login / Signup

Inference of the distribution of fitness effects of mutations is affected by single nucleotide polymorphism filtering methods, sample size and population structure.

Bea Angelica AnderssonWei ZhaoBenjamin C HallerÅke BrännströmXiao-Ru Wang
Published in: Molecular ecology resources (2023)
The distribution of fitness effects (DFE) of new mutations has been of interest to evolutionary biologists since the concept of mutations arose. Modern population genomic data enable us to quantify the DFE empirically, but few studies have examined how data processing, sample size and cryptic population structure might affect the accuracy of DFE inference. We used simulated and empirical data (from Arabidopsis lyrata) to show the effects of missing data filtering, sample size, number of single nucleotide polymorphisms (SNPs) and population structure on the accuracy and variance of DFE estimates. Our analyses focus on three filtering methods-downsampling, imputation and subsampling-with sample sizes of 4-100 individuals. We show that (1) the choice of missing-data treatment directly affects the estimated DFE, with downsampling performing better than imputation and subsampling; (2) the estimated DFE is less reliable in small samples (<8 individuals), and becomes unpredictable with too few SNPs (<5000, the sum of 0- and 4-fold SNPs); and (3) population structure may skew the inferred DFE towards more strongly deleterious mutations. We suggest that future studies should consider downsampling for small data sets, and use samples larger than 4 (ideally larger than 8) individuals, with more than 5000 SNPs in order to improve the robustness of DFE inference and enable comparative analyses.
Keyphrases
  • electronic health record
  • big data
  • genome wide
  • body composition
  • machine learning
  • data analysis
  • single cell
  • gene expression
  • physical activity
  • dna methylation
  • copy number
  • current status