Optimal sampling for positive only electronic health record data.
Seong-Ho LeeYanyuan MaYing WeiJinbo ChenPublished in: Biometrics (2023)
Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in EHR related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean squared phenotyping/classification error (MSE). Our approach incorporates "positive only" information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real data example, and is found often satisfactory under criteria beyond mean squared error. This article is protected by copyright. All rights reserved.
Keyphrases
- electronic health record
- machine learning
- deep learning
- virtual reality
- clinical decision support
- adverse drug
- end stage renal disease
- case report
- chronic kidney disease
- ejection fraction
- big data
- minimally invasive
- high throughput
- healthcare
- prognostic factors
- patient reported outcomes
- health information
- data analysis
- social media
- patient reported