An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics.
Pei-Yuan ZhouFaith LumTony Jiecao WangAnubhav BhattiSurajsinh ParmarChen DanAndrew K C WongPublished in: Bioengineering (Basel, Switzerland) (2024)
Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.
Keyphrases
- machine learning
- big data
- healthcare
- loop mediated isothermal amplification
- artificial intelligence
- risk assessment
- deep learning
- real time pcr
- electronic health record
- label free
- small molecule
- intensive care unit
- acute kidney injury
- patient safety
- adverse drug
- emergency department
- human health
- palliative care
- physical activity
- climate change
- sensitive detection
- health information