Comparative Evaluation of LLMs in Clinical Oncology.

Nicholas R Rydzewski Deepak Dinakaran Shuang G ZhaoEytan RuppinIsmail Baris Turkbey Deborah E Citrin Krishnan R Patel

Published in: NEJM AI (2024)

Of the models tested on a standardized set of oncology questions, GPT-4 was observed to have the highest performance. Although this performance is impressive, all LLMs continue to have clinically significant error rates, including examples of overconfidence and consistent inaccuracies. Given the enthusiasm to integrate these new implementations of AI into clinical practice, continued standardized evaluations of the strengths and limitations of these products will be critical to guide both patients and medical professionals. (Funded by the National Institutes of Health Clinical Center for Research and the Intramural Research Program of the National Institutes of Health; Z99 CA999999.).

Keyphrases

healthcare
quality improvement
public health
clinical practice
palliative care
mental health
end stage renal disease
ejection fraction
newly diagnosed
health information
prognostic factors
patient reported outcomes
risk assessment
deep learning
patient reported