BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study,Radiology

当前位置： X-MOL 学术 › Radiology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
Radiology ( IF 19.7 ) Pub Date : 2024-04-30 , DOI: 10.1148/radiol.232133
Andrea Cozzi , Katja Pinker , Andri Hidber , Tianyu Zhang , Luca Bonomo , Roberto Lo Gullo , Blake Christianson , Marco Curti , Stefania Rizzo , Filippo Del Grande , Ritse M. Mann , Simone Schiaffino , Ariane Panzer

Background

The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks.

Purpose

To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management.

Materials and Methods

This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1–5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test.

Results

Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001).

Conclusion

LLMs achieved moderate agreement with human reader–assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.

Supplemental material is available for this article.

中文翻译：

GPT-3.5、GPT-4 和 Google Bard 的 BI-RADS 类别分配：多语言研究

背景

公开的大语言模型 (LLM) 对于复杂的临床任务的性能仍不清楚。

目的

评估人类读者和法学硕士之间对基于三种语言编写的乳腺影像报告分配的乳腺影像报告和数据系统 (BI-RADS) 类别的一致性，并评估不一致的类别分配对临床管理的影响。

材料和方法

这项回顾性研究包括在三个转诊中心接受 MRI、乳房 X 光检查和/或 US 进行乳腺癌筛查或诊断目的的女性的报告。 2000 年 1 月至 2023 年 10 月期间收集的结果报告被归类为 BI-RADS 1-5，并以意大利语、英语或荷兰语撰写。委员会认证的乳腺放射科医生以及法学硕士 GPT-3.5 和 GPT-4 (OpenAI) 以及巴德，现在称为 Gemini（Google），仅使用原始放射科医生描述的发现来分配 BI-RADS 类别。使用 Gwet 一致性系数（AC1 值）评估人类读者和法学硕士之间对于 BI-RADS 类别的一致性。计算影响临床管理的 BI-RADS 类别分配变化的频率（即 BI-RADS 0 与 BI-RADS 1 或 2 与 BI-RADS 3 与 BI-RADS 4 或 5），并使用 McNemar 检验进行比较。

结果

在 2400 份报告中，原始放射科医生与审查放射科医生之间的一致性几乎是完美的（AC1 = 0.91），而原始放射科医生与 GPT-4、GPT-3.5 和 Bard 之间的一致性是中等的（AC1 分别 = 0.52、0.48 和 0.42））。在人类读者和法学硕士中，观察到 BI-RADS 类别升级或降级的频率存在差异，这将导致临床管理的改变（人类读者 2400 名中的 118 名 [4.9%]，巴德 2400 名中的 611 名 [25.5%]，573 名） GPT-3.5 为 2400 [23.9%]，GPT-4 为 2400 [18.1%] 为 435；P < .001），这将对临床管理产生负面影响（人类读者为 2400 中的 37 人 [1.5%]，为人类读者的 435 人） Bard 为 2400 [18.1%]，GPT-3.5 为 2400 中的 344 [14.3%]，GPT-4 为 2400 中的 255 [10.6%]；P < .001）。