Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones.

Cosmin A BejanAmy M ReedMatthew MikulaSiwei Zhang Yaomin Xu Daniel Fabbri Peter J Embi Ryan S Hsi

Published in: medRxiv : the preprint server for health sciences (2024)

The best results were achieved by GPT-4 (macro-F1=0.833, 95% confidence interval [CI]=0.826-0.841) and GPT-3.5 (macro-F1=0.796, 95% CI=0.796-0.796), both being statistically significantly better than the ICD-based baseline result (macro-F1=0.71). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning when using the same parameter configuration. Adding demographic information and prior disease history to the prompts allows LLMs to make more accurate decisions. The evaluation of bias found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity. The analysis of explanations provided by GPT-4 demonstrates advanced capabilities of this model in understanding clinical text and reasoning with medical knowledge.

Keyphrases