Classifying Refugee Status Using Common Features in EMR.
Malia MorrisonVanessa NoblesCrista E Johnson-AgbakwuCeleste BaileyLi LiuPublished in: Chemistry & biodiversity (2022)
Automated and accurate identification of refugees in healthcare databases is a critical first step to investigate healthcare needs of this vulnerable population and improve health disparities. In this study, we developed a machine-learning method, named refugee identification system (RIS) to address this need. We curated a data set consisting of 103 refugees and 930 non-refugees in Arizona. We compiled de-identified individual-level information including age, primary language, and noise-masked home address, state-level refugee resettlement statistics, and world language statistics. We then performed feature engineering to convert language and masked address into quantitative features. Finally, we built a random forest model to classify refugee and non-refugees. RIS achieved high classification accuracy (overall accuracy=0.97, specificity=0.99, sensitivity=0.85, positive predictive value=0.88, negative predictive value=0.98, and area under receiver operating characteristic curve=0.98). RIS is customizable for refugee identification outside Arizona. Its application enables large-scale investigation of refugee healthcare needs and improvement of health disparities.