Reply to Hu et al.: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance.
Evelina LeivadaFritz GüntherVittoria DentellaPublished in: Proceedings of the National Academy of Sciences of the United States of America (2024)