Login / Signup

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

Jiang BianBalu BhasuranQiao JinShubo TianKarim HannaCindy ShavorLisbeth Garcia ArguelloPatrick MurrayZhiyong Lu
Published in: Journal of medical Internet research (2024)
By evaluating LLMs in generating responses to patients' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4's responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.
Keyphrases