Evaluating diagnostic content of AI-generated radiology reports of chest X-rays,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating diagnostic content of AI-generated radiology reports of chest X-rays
Artificial Intelligence in Medicine ( IF 7.5 ) Pub Date : 2021-04-15 , DOI: 10.1016/j.artmed.2021.102075
Zaheer Babar ₁ , Twan van Laarhoven ₁ , Fabio Massimo Zanzotto ₂ , Elena Marchiori ₁

Affiliation

Radiology reports are of core importance for the communication between the radiologist and clinician. A computer-aided radiology report system can assist radiologists in this task and reduce variation between reports thus facilitating communication with the medical doctor or clinician. Producing a well structured, clear, and clinically well-focused radiology report is essential for high-quality patient diagnosis and care. Despite recent advances in deep learning for image caption generation, this task remains highly challenging in a medical setting. Research has mainly focused on the design of tailored machine learning methods for this task, while little attention has been devoted to the development of evaluation metrics to assess the quality of AI-generated documents. Conventional quality metrics for natural language processing methods like the popular BLEU score, provide little information about the quality of the diagnostic content of AI-generated radiology reports. In particular, because radiology reports often use standardized sentences, BLEU scores of generated reports can be high while they lack diagnostically important information. We investigate this problem and propose a new measure that quantifies the diagnostic content of AI-generated radiology reports. In addition, we exploit the standardization of reports by generating a sequence of sentences. That is, instead of using a dictionary of words, as current image captioning methods do, we use a dictionary of sentences. The assumption underlying this choice is that radiologists use a well-focused vocabulary of ‘standard’ sentences, which should suffice for composing most reports. As a by-product, a significant training speed-up is achieved compared to models trained on a dictionary of words. Overall, results of our investigation indicate that standard validation metrics for AI-generated documents are weakly correlated with the diagnostic content of the reports. Therefore, these measures should be not used as only validation metrics, and measures evaluating diagnostic content should be preferred in such a medical context.

中文翻译：

评估 AI 生成的胸部 X 射线放射学报告的诊断内容

放射学报告对于放射科医生和临床医生之间的沟通至关重要。计算机辅助放射学报告系统可以协助放射科医师完成这项任务并减少报告之间的差异，从而促进与医生或临床医生的沟通。生成结构良好、清晰且具有临床重点的放射学报告对于高质量的患者诊断和护理至关重要。尽管最近在图像字幕生成的深度学习方面取得了进展，但这项任务在医疗环境中仍然极具挑战性。研究主要集中在为这项任务设计量身定制的机器学习方法，而很少关注开发评估指标来评估 AI 生成的文档的质量。自然语言处理方法的传统质量指标（如流行的 BLEU 分数）几乎不提供有关 AI 生成的放射学报告诊断内容质量的信息。特别是，由于放射学报告通常使用标准化的句子，生成的报告的 BLEU 分数可能很高，但缺乏诊断上的重要信息。我们调查了这个问题，并提出了一种量化 AI 生成的放射学报告的诊断内容的新措施。此外，我们通过生成句子序列来利用报告的标准化。也就是说，我们不像当前的图像字幕方法那样使用单词词典，而是使用句子词典。这种选择的基础假设是放射科医生使用重点突出的“标准”句子词汇，这应该足以编写大多数报告。作为副产品，与在单词词典上训练的模型相比，实现了显着的训练加速。总体而言，我们的调查结果表明，人工智能生成文档的标准验证指标与报告的诊断内容相关性较弱。因此，这些措施不应仅用作验证指标，在这种医学背景下，应首选评估诊断内容的措施。

更新日期：2021-04-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>