Comparison of ChatGPT vs. Bard to Anesthesia-related Queries,medRxiv - Anesthesia

当前位置： X-MOL 学术 › medRxiv. Anesth. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparison of ChatGPT vs. Bard to Anesthesia-related Queries
medRxiv - Anesthesia Pub Date : 2023-06-30 , DOI: 10.1101/2023.06.29.23292057
Sourav S. Patnaik , Ulrike Hoffmann

We investigated the ability of large language models (LLMs) to answer anesthesia related queries prior to surgery from a patient′s point of view. In the study, we introduced textual data evaluation metrics, investigated ″hallucinations″ phenomenon, and evaluated feasibility of using LLMs at the patient-clinician interface. ChatGPT was found to be lengthier, intellectual, and effective in its response as compared to Bard. Upon clinical evaluation, no ″hallucination″ errors were reported from ChatGPT, whereas we observed a 30.3% error in response from Bard. ChatGPT responses were difficult to read (college level difficulty) while Bard responses were more conversational and about 8th grade level from readability calculations. Linguistic quality of ChatGPT was found to be 19.7% greater for Bard (66.16 ± 13.42 vs. 55.27 ± 11.76; p=0.0037) and was independent of response length. Computational sentiment analysis revelated that polarity scores of on a Bard was significantly greater than ChatGPT (mean 0.16 vs. 0.11 on scale of -1 (negative) to 1 (positive); p=0.0323) and can be classified as ″positive″′ whereas subjectivity scores were similar across LLM′s (mean 0.54 vs 0.50 on a scale of 0 (objective) to 1 (subjective), p=0.3030). Even though the majority of the LLM responses were appropriate, at this stage these chatbots should be considered as a versatile clinical resource to assist communication between clinicians and patients, and not a replacement of essential pre-anesthesia consultation. Further efforts are needed to incorporate health literacy that will improve patient-clinical communications and ultimately, post-operative patient outcomes.

中文翻译：

ChatGPT 与 Bard 麻醉相关查询的比较

我们研究了大型语言模型 (LLM) 在手术前从患者的角度回答麻醉相关问题的能力。在这项研究中，我们引入了文本数据评估指标，研究了“幻觉”现象，并评估了在患者与临床医生界面使用法学硕士的可行性。与 Bard 相比，ChatGPT 的响应时间更长、更智能且更有效。经过临床评估，ChatGPT 没有报告“幻觉”错误，而我们观察到 Bard 的响应错误率为 30.3%。ChatGPT 的回答很难阅读（大学水平的难度），而 Bard 的回答则更具会话性，从可读性计算来看，大约是八年级的水平。发现 Bard 的 ChatGPT 语言质量高出 19.7%（66.16 ± 13.42 对比 55.27 ± 11.76；p=0。0037）并且与响应长度无关。计算情感分析显示，Bard 的极性得分显着高于 ChatGPT（在 -1（负面）到 1（正面）范围内的平均值为 0.16 比 0.11；p=0.0323），并且可以归类为“正面”，而法学硕士的主观得分相似（0（客观）到 1（主观）范围内的平均值为 0.54 与 0.50，p=0.3030）。尽管大多数法学硕士的回答都是适当的，但在现阶段，这些聊天机器人应被视为一种多功能的临床资源，以协助临床医生和患者之间的沟通，而不是替代必要的麻醉前咨询。需要进一步努力纳入健康素养，以改善患者与临床的沟通，并最终改善术后患者的治疗结果。计算情感分析显示，Bard 的极性得分显着高于 ChatGPT（在 -1（负面）到 1（正面）范围内的平均值为 0.16 比 0.11；p=0.0323），并且可以归类为“正面”，而法学硕士的主观得分相似（0（客观）到 1（主观）范围内的平均值为 0.54 与 0.50，p=0.3030）。尽管大多数法学硕士的回答都是适当的，但在现阶段，这些聊天机器人应被视为一种多功能的临床资源，以协助临床医生和患者之间的沟通，而不是替代必要的麻醉前咨询。需要进一步努力纳入健康素养，以改善患者与临床的沟通，并最终改善术后患者的治疗效果。计算情感分析显示，Bard 的极性得分显着高于 ChatGPT（在 -1（负面）到 1（正面）范围内的平均值为 0.16 比 0.11；p=0.0323），并且可以归类为“正面”，而法学硕士的主观得分相似（0（客观）到 1（主观）范围内的平均值为 0.54 与 0.50，p=0.3030）。尽管大多数法学硕士的回答都是适当的，但在现阶段，这些聊天机器人应被视为一种多功能的临床资源，以协助临床医生和患者之间的沟通，而不是替代必要的麻醉前咨询。需要进一步努力纳入健康素养，以改善患者与临床的沟通，并最终改善术后患者的治疗结果。

更新日期：2023-07-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>