Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-08-08 , DOI: 10.1016/j.knosys.2020.106339
Yun Liu , Xiaoming Zhang , Feiran Huang , Zhibo Zhou , Zhonghua Zhao , Zhoujun Li

Visual Question Answering (VQA) has emerged and aroused widespread interest in recent years. Its purpose is to explore the close correlations between the image and question for answer inference. We have two observations about the VQA task: (1) the number of newly defined answers is ever-growing, which means that answer prediction on pre-defined labeled answers may lead to errors, as some unlabeled answers may be the right choice to the question–image pairs; (2) in the process of answering visual questions, the gradual change of human attention has an important guiding role in exploring the correlations between images and questions. Based on these observations, we propose a novel model for VQA, i.e., combining Inferential Attention and Semantic Space Mapping (IASSM). Specifically, our model has two salient aspects: (1) a semantic space shared by both the labeled and unlabeled answers is constructed to learn new answers, where the joint embedding of a question and the corresponding image is mapped and clustered around the answer exemplar; (2) a novel inferential attention model is designed to simulate the learning process of human attention to explore the correlations between the image and question. It focuses on the more important question words and image regions associated with the question. Both the inferential attention and the semantic space mapping modules are integrated into an end-to-end framework to infer the answer. Experiments performed on two public VQA datasets and our newly constructed dataset show the superiority of IASSM compared with existing methods.

中文翻译：

结合推理注意和语义空间映射的视觉问答

视觉问答（VQA）近年来已经出现并引起了广泛的兴趣。其目的是探索图像和问题之间的紧密联系，以进行答案推理。关于VQA任务，我们有两个观察结果：（1）新定义答案的数量不断增长，这意味着对预定义标记答案的答案预测可能会导致错误，因为一些未标记答案可能是正确答案。问题-图像对；（2）在回答视觉问题的过程中，人的注意力的逐渐变化对探索图像与问题之间的相关性具有重要的指导作用。基于这些观察，我们提出了一种新的VQA模型，即将推理注意力和语义空间映射（IASSM）结合起来。具体来说，我们的模型有两个显着方面：（1）构建标记和未标记答案共享的语义空间以学习新答案，其中问题和相应图像的联合嵌入被映射并聚集在答案示例周围；（2）设计了一种新颖的推理注意力模型来模拟人类注意力的学习过程，以探索图像与问题之间的相关性。它着重于更重要的疑问词和与该问题相关的图像区域。推理注意力和语义空间映射模块都集成到端到端框架中以推断答案。在两个公共VQA数据集和我们新构建的数据集上进行的实验表明，与现有方法相比，IASSM具有优越性。问题和相应图像的联合嵌入在答案示例周围进行映射和聚类；（2）设计了一种新颖的推理注意力模型来模拟人类注意力的学习过程，以探索图像与问题之间的相关性。它着重于更重要的疑问词和与该问题相关的图像区域。推理注意力和语义空间映射模块都集成到端到端框架中以推断答案。在两个公共VQA数据集和我们新构建的数据集上进行的实验表明，与现有方法相比，IASSM具有优越性。问题和相应图像的联合嵌入在答案示例周围进行映射和聚类；（2）设计了一种新颖的推理注意力模型来模拟人类注意力的学习过程，以探索图像与问题之间的相关性。它着重于更重要的疑问词和与该问题相关的图像区域。推理注意力和语义空间映射模块都集成到端到端框架中以推断答案。在两个公共VQA数据集和我们新建的数据集上进行的实验表明，与现有方法相比，IASSM具有优越性。（2）设计了一种新颖的推理注意力模型来模拟人类注意力的学习过程，以探索图像与问题之间的相关性。它着重于更重要的疑问词和与该问题相关的图像区域。推理注意力和语义空间映射模块都集成到端到端框架中以推断答案。在两个公共VQA数据集和我们新构建的数据集上进行的实验表明，与现有方法相比，IASSM具有优越性。（2）设计了一种新颖的推理注意力模型来模拟人类注意力的学习过程，以探索图像与问题之间的相关性。它着重于更重要的疑问词和与该问题相关的图像区域。推理注意力和语义空间映射模块都集成到端到端框架中以推断答案。在两个公共VQA数据集和我们新建的数据集上进行的实验表明，与现有方法相比，IASSM具有优越性。

更新日期：2020-08-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11