Dual self-attention with co-attention networks for visual question answering,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dual self-attention with co-attention networks for visual question answering
Pattern Recognition ( IF 7.5 ) Pub Date : 2021-04-09 , DOI: 10.1016/j.patcog.2021.107956
Yun Liu , Xiaoming Zhang , Qianyun Zhang , Chaozhuo Li , Feiran Huang , Xianghong Tang , Zhoujun Li

Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three submodules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods.

中文翻译：

具有共同注意力网络的双重自我关注，用于视觉提问

视觉问答（VQA）作为理解视觉和语言的重要任务已被提出并引起了广泛的兴趣。在以前的VQA方法中，通常使用卷积神经网络（CNN）和递归神经网络（RNN）分别提取视觉和文本特征，然后探索这两个特征之间的相关性以得出答案。但是，CNN主要专注于提取局部空间信息，而RNN则更注重开发顺序体系结构和远程依赖项。他们很难将局部特征与其全局依赖性相集成，以学习图像和问题的更有效表示。为了解决这个问题，我们为VQA提出了一种新颖的模型，即具有共同注意网络的双重自我注意（DSACA）。它旨在通过使用新提出的自我注意机制分别对空间结构和顺序结构的内部依赖性进行建模。具体来说，DSACA主要包含三个子模块。视觉自我注意模块通过在所有位置上的特征的加权总和来选择性地聚合每个区域的视觉特征。文本自我注意模块通过整合句子单词之间的关联特征来自动强调相互依存的单词特征。此外，视觉文本共同注意模块探索了从自我注意模块学到的视觉和文本特征之间的紧密相关性。这三个模块被集成到端到端框架中以推断答案。

更新日期：2021-04-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11