Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Rajesh BhayanaBipin NandaTaher DehkharghanianYangqing DengNishaant BhambraGavin J B EliasDaksh DattaAvinash R KambadakoneChaya G ShwaartzCarol-Anne MoultonDavid HenaultSteven GallingerSatheesh KrishnaPublished in: Radiology (2024)
Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.
Keyphrases
- adverse drug
- artificial intelligence
- healthcare
- machine learning
- autism spectrum disorder
- quality improvement
- high resolution
- big data
- lymph node
- emergency department
- end stage renal disease
- computed tomography
- decision making
- magnetic resonance imaging
- newly diagnosed
- prognostic factors
- electronic health record
- ejection fraction
- peritoneal dialysis
- functional connectivity
- positron emission tomography