Login / Signup

SNP variable selection by generalized graph domination.

Shuzhen SunZhuqi MiaoBlaise RatcliffePolly CampbellBret PaschYousry A El-KassabyBalabhaskar BalasundaramCharles Chen
Published in: PloS one (2019)
K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).
Keyphrases
  • genome wide
  • high density
  • dna methylation
  • genetic diversity
  • convolutional neural network
  • gene expression
  • cross sectional
  • hepatitis c virus
  • network analysis
  • hiv testing
  • data analysis