Login / Signup

EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records.

Jinsung YoonMichel MizrahiNahid Farhady GhalatyThomas JarvinenAshwin S RaviPeter BruneFanyu KongDave AndersonGeorge LeeArie MeirFarhana BandukwalaElli KanalSercan Ö ArıkTomas Pfister
Published in: NPJ digital medicine (2023)
Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics.
Keyphrases
  • electronic health record
  • clinical decision support
  • big data
  • adverse drug
  • health information
  • machine learning
  • artificial intelligence
  • healthcare
  • single cell
  • deep learning
  • body composition
  • resistance training