Login / Signup

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments.

Brendin R Beaulieu-JonesSahaj ShahMargaret T BerriganJayson S MarwahaShuo-Lun LaiGabriel A Brat
Published in: medRxiv : the preprint server for health sciences (2023)
Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.
Keyphrases
  • healthcare
  • palliative care
  • endothelial cells
  • autism spectrum disorder
  • mental health
  • quality improvement
  • mass spectrometry
  • chronic pain
  • high speed
  • affordable care act