Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

Jie Xu Lu Lu Xinwei Peng Jiali Pang Jinru Ding Lingrui Yang Huan Song Kang Li Xin Sun Shaoting Zhang

Published in: JMIR medical informatics (2024)

MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.

Keyphrases

electronic health record
healthcare
big data
mental health
autism spectrum disorder
machine learning
editorial comment