Login / Signup

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Ileana Montoya PerezParisa MovahediValtteri NieminenAntti AirolaTapio Pahikkala
Published in: Methods of information in medicine (2024)
A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ≤1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ≥ 5) in order to have reasonable Type II error levels.
Keyphrases
  • big data
  • electronic health record
  • health information
  • artificial intelligence
  • machine learning
  • healthcare
  • emergency department
  • magnetic resonance imaging
  • air pollution
  • patient safety
  • health insurance