Errors in multiple variables in human immunodeficiency virus (HIV) cohort and electronic health record data: statistical challenges and opportunities.

Published in: Statistical communications in infectious diseases (2020)

Objectives: Observational data derived from patient electronic health records (EHR) data are increasingly used for human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) research. There are challenges to using these data, in particular with regards to data quality; some are recognized, some unrecognized, and some recognized but ignored. There are great opportunities for the statistical community to improve inference by incorporating validation subsampling into analyses of EHR data. Methods: Methods to address measurement error, misclassification, and missing data are relevant, as are sampling designs such as two-phase sampling. However, many of the existing statistical methods for measurement error, for example, only address relatively simple settings, whereas the errors seen in these datasets span multiple variables (both predictors and outcomes), are correlated, and even affect who is included in the study. Results/Conclusion: We will discuss some preliminary methods in this area with a particular focus on time-to-event outcomes and outline areas of future research.

Keyphrases