Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.

Hamoun SabriMuhammad H A SalehParham HazratiKeith MerchantJonathan MischPurnima S Kumar Hom-Lay Wang Shayan Barootchi

Published in: Journal of periodontal research (2024)

Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.

Keyphrases