Login / Signup

A Comparison of High Dimensional Variable Selection Methods with Missing Covariates in a Prostate Cancer Study.

Chi ChenJiwei ZhaoJeffrey MiecznikowskiMarianthi Markatou
Published in: Communications in statistics. Case studies, data analysis and applications (2019)
Prostate cancer is the most common cancer in American men. Dozens of specific genes have been shown to be correlated to prostate cancer, to benign and non-benign cancer cases, from a biology perspective. In this paper, we apply a penalized logistic regression model with different penalty functions to select genes that contribute to benign and non-benign cases, based on the data from a prostate cancer study. The tuning parameter is determined by cross validation. In order to take into account some specific genes that have been classified as prostate cancer genes through biology research but with missing values, multiple imputation is adopted to create complete data sets. We analyze the prostate cancer data by comparing the selection results with completely observed data only, and the results with imputed data. We also conduct a simulation study to validate our proposed method.
Keyphrases
  • prostate cancer
  • radical prostatectomy
  • electronic health record
  • big data
  • genome wide
  • papillary thyroid
  • gene expression
  • genome wide identification
  • squamous cell