当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
arXiv - CS - Multimedia Pub Date : 2021-05-01 , DOI: arxiv-2105.00136
Haifan Gong, Guanqi Chen, Sishuo Liu, Yizhou Yu, Guanbin Li

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA.

中文翻译:

跨模式自我注意与多任务预训练的医学视觉问题解答

由于标记数据的严重缺乏,现有的医学视觉问题解答方法通常依靠转移学习来获得有效的图像特征表示,并使用视觉和语言特征的交叉模式融合来实现与问题相关的答案预测。这两个阶段是独立执行的,无需考虑交叉模式融合的预训练特征的兼容性和适用性。因此,我们将图像特征预训练重新设计为一种多任务学习范例,并见证了其非凡的优越性,迫使它考虑到特征在特定图像理解任务中的适用性。此外,我们引入了一种跨模式的自我注意〜(CMSA)模块,以选择性地捕获远程上下文相关性,从而更有效地融合视觉和语言功能。实验结果表明,所提出的方法优于现有的最新方法。我们的代码和模型可从https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA获得。
更新日期:2021-05-04
down
wechat
bug