Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.
Keyphrases