Learning to Recognize Visual Concepts for Visual Question Answering with Structural Label Space,IEEE Journal of Selected Topics in Signal Processing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to Recognize Visual Concepts for Visual Question Answering with Structural Label Space
IEEE Journal of Selected Topics in Signal Processing ( IF 7.5 ) Pub Date : 2020-03-01 , DOI: 10.1109/jstsp.2020.2989701
Difei Gao , Ruiping Wang , Shiguang Shan , Xilin Chen

Solving visual question answering (VQA) task requires recognizing many diverse visual concepts as the answer. These visual concepts contain rich structural semantic meanings, e.g., some concepts in VQA are highly related (e.g., red & blue), some of them are less relevant (e.g., red & standing). It is very natural for humans to efficiently learn concepts by utilizing their semantic meanings to concentrate on distinguishing relevant concepts and eliminate the disturbance of irrelevant concepts. However, previous works usually use a simple MLP to output visual concept as the answer in a flat label space that treats all labels equally, causing limitations in representing and using the semantic meanings of labels. To address this issue, we propose a novel visual recognition module named Dynamic Concept Recognizer (DCR), which is easy to be plugged in an attention-based VQA model, to utilize the semantics of the labels in answer prediction. Concretely, we introduce two key features in DCR: 1) a novel structural label space to depict the difference of semantics between concepts, where the labels in new label space are assigned to different groups according to their meanings. This type of semantic information helps decompose the visual recognizer in VQA into multiple specialized sub-recognizers to improve the capacity and efficiency of the recognizer. 2) A feature attention mechanism to capture the similarity between relevant groups of concepts, e.g., human-related group “chef, waiter” is more related to “swimming, running, etc.” than scene related group “sunny, rainy, etc.”. This type of semantic information helps sub-recognizers for relevant groups to adaptively share part of modules and to share the knowledge between relevant sub-recognizers to facilitate the learning procedure. Extensive experiments on several datasets have shown that the proposed structural label space and DCR module can efficiently learn the visual concept recognition and benefit the performance of the VQA model.

中文翻译：

学习使用结构标签空间识别视觉问题回答的视觉概念

解决视觉问答 (VQA) 任务需要识别许多不同的视觉概念作为答案。这些视觉概念包含丰富的结构语义，例如，VQA 中的一些概念高度相关（例如，红色和蓝色），其中一些不太相关（例如，红色和站立）。人类通过利用概念的语义意义来高效地学习概念，专注于区分相关概念并消除无关概念的干扰，这是很自然的。然而，以前的工作通常使用一个简单的 MLP 在一个平等对待所有标签的平面标签空间中输出视觉概念作为答案，从而导致在表示和使用标签的语义方面存在局限性。为了解决这个问题，我们提出了一种名为动态概念识别器（DCR）的新型视觉识别模块，这很容易插入到基于注意力的 VQA 模型中，以在答案预测中利用标签的语义。具体来说，我们在 DCR 中引入了两个关键特征：1）一种新颖的结构标签空间来描述概念之间的语义差异，其中新标签空间中的标签根据其含义被分配到不同的组。这类语义信息有助于将 VQA 中的视觉识别器分解为多个专门的子识别器，以提高识别器的容量和效率。2）一个特征注意力机制来捕捉相关概念组之间的相似性，例如，与人类相关的组“厨师、服务员”与“游泳、跑步等”更相关。比场景相关组“晴天、雨天等”。这种语义信息有助于相关组的子识别器自适应地共享部分模块，并在相关子识别器之间共享知识以促进学习过程。对多个数据集的大量实验表明，所提出的结构标签空间和 DCR 模块可以有效地学习视觉概念识别并有利于 VQA 模型的性能。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>