Generating Large Language Models for Detection of Speech Recognition Errors in Radiology Reports.

Reuben A Schmidt Jarrel C Y SeahKe CaoLincoln J LimWei LimJustin Yeung

Published in: Radiology. Artificial intelligence (2024)

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence . This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3,233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision 76.9%, recall 100%, F1 86.9%) and not clinically significant errors (93.9% precision, 94.7% recall, 94.3% F1). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. ©RSNA, 2024.

Keyphrases