Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments.

Brendin R Beaulieu-JonesSahaj ShahMargaret T Berrigan Jayson S MarwahaShuo-Lun LaiGabriel A Brat

Published in: medRxiv : the preprint server for health sciences (2023)

Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.

Keyphrases

healthcare
palliative care
endothelial cells
autism spectrum disorder
mental health
quality improvement
mass spectrometry
chronic pain
high speed
affordable care act