Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis,Surgical Endoscopy

当前位置： X-MOL 学术 › Surg. Endosc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis
Surgical Endoscopy ( IF 3.1 ) Pub Date : 2024-03-05 , DOI: 10.1007/s00464-024-10739-5
Yazid K. Ghanem , Armaun D. Rouhi , Ammr Al-Houssan , Zena Saleh , Matthew C. Moccia , Hansa Joshi , Kristoffel R. Dumon , Young Hong , Francis Spitz , Amit R. Joshi , Michael Kwiatt

Introduction

Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis.

Methods

A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16–80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms.

Results

ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16–80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level.

Conclusion

AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.

中文翻译：

Google 博士致 ChatGPT 博士：评估人工智能生成的阑尾炎医疗信息的内容和质量

介绍

生成人工智能（AI）聊天机器人最近被认为是患者做出医疗决策的在线医疗信息的潜在来源。现有的以患者为导向的在线医疗信息已多次被证明质量参差不齐且难以阅读。因此，我们试图评估人工智能生成的急性阑尾炎医疗信息的内容和质量。

方法

改进的 DISCERN 评估工具包含 16 个不同的标准，每个标准均采用 5 点李克特量表（分数范围 16-80）进行评分，用于评估人工智能生成的内容。可读性是使用 Flesch Reading Ease (FRE) 和 Flesch-Kincaid Grade Level (FKGL) 分数来确定的。四个流行的聊天机器人 ChatGPT-3.5 和 ChatGPT-4、Bard 和 Claude-2 被提示生成有关阑尾炎的医疗信息。三名调查人员独立对生成的文本进行评分，不了解人工智能平台的身份。

结果

ChatGPT-3.5、ChatGPT-4、Bard 和 Claude-2 的总体平均 (SD) 质量得分分别为 60.7 (1.2)、62.0 (1.0)、62.3 (1.2) 和 51.3 (2.3)，评分范围为16-80。评估者间的可靠性分别为 0.81、0.75、0.81 和 0.72，表明基本一致。与 ChatGPT-4 ( p = 0.001)、ChatGPT-3.5 ( p = 0.005) 和 Bard ( p = 0.001)相比，Claude-2 的平均质量得分显着较低。Bard 是唯一列出可验证来源的人工智能平台，而 Claude-2 则提供捏造的来源。除 Claude-2 之外的所有聊天机器人都建议读者在出现症状时咨询医生。在可读性方面，ChatGPT-3.5、ChatGPT-4、Bard 和 Claude-2 的 FKGL 和 FRE 分数分别为 14.6 和 23.8、11.9 和 33.9、8.6 和 52.8、11.0 和 36.6，表明大学阅读技能水平的可读性困难。