Variation in model performance by data cleanliness and classification methods in the prediction of 30-day ICU mortality, a US nationwide retrospective cohort and simulation study.

Theodore J IwashynaCheng MaXiao Qing WangSarah SeelyeJi ZhuAkbar K Waljee

Published in: BMJ open (2020)

Variation in discrimination was seen as a function of data cleanliness, with logistic regression suffering the most loss of discrimination in the least clean data. Losses in discrimination were not present in random forest and neural networks even in naively extracted data. Data from a large nationwide health system revealed interactions between missing data imputation techniques, data cleanliness and classification methods for predicting 30-day mortality.

Keyphrases

electronic health record
big data
neural network
cardiovascular disease
cross sectional
data analysis