Login / Signup

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

Jie XuLu LuXinwei PengJiali PangJinru DingLingrui YangHuan SongKang LiXin SunShaoting Zhang
Published in: JMIR medical informatics (2024)
MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.
Keyphrases
  • electronic health record
  • healthcare
  • big data
  • mental health
  • autism spectrum disorder
  • machine learning
  • editorial comment