Login / Signup

A SEMIPARAMETRIC MULTIPLE IMPUTATION APPROACH TO FULLY SYNTHETIC DATA FOR COMPLEX SURVEYS.

Mandi YuYulei HeTrivellore E Raghunathan
Published in: Journal of survey statistics and methodology (2022)
Data synthesis is an effective statistical approach for reducing data disclosure risk. Generating fully synthetic data might minimize such risk, but its modeling and application can be difficult for data from large, complex surveys. This article extended the two-stage imputation to simultaneously impute item missing values and generate fully synthetic data. A new combining rule for making inferences using data generated in this manner was developed. Two semiparametric missing data imputation models were adapted to generate fully synthetic data for skewed continuous variable and sparse binary variable, respectively. The proposed approach was evaluated using simulated data and real longitudinal data from the Health and Retirement Study. The proposed approach was also compared with two existing synthesis approaches: (1) parametric regressions models as implemented in IVEware ; and (2) nonparametric Classification and Regression Trees as implemented in synthpop package for R using real data. The results show that high data utility is maintained for a wide variety of descriptive and model-based statistics using the proposed strategy. The proposed strategy also performs better than existing methods for sophisticated analyses such as factor analysis.
Keyphrases
  • electronic health record
  • big data
  • public health
  • healthcare
  • machine learning
  • risk assessment
  • cross sectional
  • deep learning
  • social media
  • climate change
  • breast cancer risk