Generalizability and Diagnostic Performance of AI Models for Thyroid US.

WenWen XuXiaoHong JiaZiHan MeiXiaoLin Gu Yang Lu Chi-Cheng FuRuiFang ZhangYing GuXia ChenXiaoMao LuoNing LiBaoYan BaiQiaoYing LiJiPing YanHong ZhaiLing GuanBing GongKeYang Zhao Qu Fang Chuan He Wei Wei ZhanTing LuoHuiTing ZhangYiJie DongJian Qiao Zhounull null

Published in: Radiology (2023)

Background Artificial intelligence (AI) models have improved US assessment of thyroid nodules; however, the lack of generalizability limits the application of these models. Purpose To develop AI models for segmentation and classification of thyroid nodules in US using diverse data sets from nationwide hospitals and multiple vendors, and to measure the impact of the AI models on diagnostic performance. Materials and Methods This retrospective study included consecutive patients with pathologically confirmed thyroid nodules who underwent US using equipment from 12 vendors at 208 hospitals across China from November 2017 to January 2019. The detection, segmentation, and classification models were developed based on the subset or complete set of images. Model performance was evaluated by precision and recall, Dice coefficient, and area under the receiver operating characteristic curve (AUC) analyses. Three scenarios (diagnosis without AI assistance, with freestyle AI assistance, and with rule-based AI assistance) were compared with three senior and three junior radiologists to optimize incorporation of AI into clinical practice. Results A total of 10 023 patients (median age, 46 years [IQR 37-55 years]; 7669 female) were included. The detection, segmentation, and classification models had an average precision, Dice coefficient, and AUC of 0.98 (95% CI: 0.96, 0.99), 0.86 (95% CI: 0.86, 0.87), and 0.90 (95% CI: 0.88, 0.92), respectively. The segmentation model trained on the nationwide data and classification model trained on the mixed vendor data exhibited the best performance, with a Dice coefficient of 0.91 (95% CI: 0.90, 0.91) and AUC of 0.98 (95% CI: 0.97, 1.00), respectively. The AI model outperformed all senior and junior radiologists ( P < .05 for all comparisons), and the diagnostic accuracies of all radiologists were improved ( P < .05 for all comparisons) with rule-based AI assistance. Conclusion Thyroid US AI models developed from diverse data sets had high diagnostic performance among the Chinese population. Rule-based AI assistance improved the performance of radiologists in thyroid cancer diagnosis. © RSNA, 2023 Supplemental material is available for this article.

Keyphrases