VQAMix: Conditional Triplet Mixup for Medical Visual Question Answering,IEEE Transactions on Medical Imaging

当前位置： X-MOL 学术 › IEEE Trans. Med. Imaging › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VQAMix: Conditional Triplet Mixup for Medical Visual Question Answering
IEEE Transactions on Medical Imaging ( IF 8.9 ) Pub Date : 6-21-2022 , DOI: 10.1109/tmi.2022.3185008
Haifan Gong ₁ , Guanqi Chen ₁ , Mingzhi Mao ₁ , Zhen Li ₂ , Guanbin Li ₁

Affiliation

Medical visual question answering (VQA) aims to correctly answer a clinical question related to a given medical image. Nevertheless, owing to the expensive manual annotations of medical data, the lack of labeled data limits the development of medical VQA. In this paper, we propose a simple yet effective data augmentation method, VQAMix, to mitigate the data limitation problem. Specifically, VQAMix generates more labeled training samples by linearly combining a pair of VQA samples, which can be easily embedded into any visual-language model to boost performance. However, mixing two VQA samples would construct new connections between images and questions from different samples, which will cause the answers for those new fabricated image-question pairs to be missing or meaningless. To solve the missing answer problem, we first develop the Learning with Missing Labels (LML) strategy, which roughly excludes the missing answers. To alleviate the meaningless answer issue, we design the Learning with Conditional-mixed Labels (LCL) strategy, which further utilizes language-type prior to forcing the mixed pairs to have reasonable answers that belong to the same category. Experimental results on the VQA-RAD and PathVQA benchmarks show that our proposed method significantly improves the performance of the baseline by about 7% and 5% on the averaging result of two backbones, respectively. More importantly, VQAMix could improve confidence calibration and model interpretability, which is significant for medical VQA models in practical applications. All code and models are available at https://github.com/haifangong/VQAMix.

中文翻译：

VQAMix：用于医学视觉问答的条件三元组混合

医学视觉问答（VQA）旨在正确回答与给定医学图像相关的临床问题。然而，由于医学数据的手动注释成本高昂，标记数据的缺乏限制了医学VQA的发展。在本文中，我们提出了一种简单而有效的数据增强方法 VQAMix，以缓解数据限制问题。具体来说，VQAMix 通过线性组合一对 VQA 样本来生成更多标记的训练样本，这些样本可以轻松嵌入到任何视觉语言模型中以提高性能。然而，混合两个 VQA 样本会在来自不同样本的图像和问题之间构建新的连接，这将导致这些新制作的图像-问题对的答案丢失或毫无意义。为了解决缺失答案问题，我们首先开发了缺失标签学习（LML）策略，该策略大致排除了缺失答案。为了缓解无意义的答案问题，我们设计了条件混合标签学习（LCL）策略，该策略进一步利用语言类型优先，强制混合对具有属于同一类别的合理答案。 VQA-RAD 和 PathVQA 基准测试的实验结果表明，我们提出的方法在两个主干网的平均结果上分别显着提高了基线性能约 7% 和 5%。更重要的是，VQAMix 可以提高置信度校准和模型可解释性，这对于实际应用中的医学 VQA 模型具有重要意义。所有代码和模型均可在 https://github.com/haifangong/VQAMix 获取。

更新日期：2024-08-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11