Scalable Feature Engineering from Electronic Free Text Notes to Supplement Confounding Adjustment of Claims-Based Pharmacoepidemiologic Studies.
Richard WyssJoseph M PlasekLi ZhouLily G BessetteSebastian G SchneeweissJeremy A RassenTheodore TsacogianisKueiyu Joshua LinPublished in: Clinical pharmacology and therapeutics (2023)
Natural language processing (NLP) tools turn free-text notes (FTNs) from electronic health records (EHRs) into data features that can supplement confounding adjustment in pharmacoepidemiologic studies. However, current applications are difficult to scale. We used unsupervised NLP to generate high-dimensional feature spaces from FTNs to improve prediction of drug exposure and outcomes compared with claims-based analyses. We linked Medicare claims with EHR data to generate three cohort studies comparing different classes of medications on the risk of various clinical outcomes. We used "bag-of-words" to generate features for the top 20,000 most prevalent terms from FTNs. We compared machine learning (ML) prediction algorithms using different sets of candidate predictors: Set1 (39 researcher-specified variables), Set2 (Set1 + ML-selected claims codes), and Set3 (Set1 + ML-selected NLP-generated features), vs. Set4 (Set1 + 2 + 3). When modeling treatment choice, we observed a consistent pattern across the examples: ML models utilizing Set4 performed best followed by Set2, Set3, then Set1. When modeling the outcome risk, there was little to no improvement beyond models based on Set1. Supplementing claims data with NLP-generated features from free text notes improved prediction of prescribing choices but had little or no improvement on clinical risk prediction. These findings have implications for strategies to improve confounding using EHR data in pharmacoepidemiologic studies.