Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2020-11-03 , DOI: 10.1109/tip.2020.3034494
Jing Yu , Xiaoze Jiang , Zengchang Qin , Weifeng Zhang , Yue Hu , Qi Wu

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a broad range of visual content that could be related to any objects, relationships or high-level semantics. Thus one of the key challenges in Visual Dialogue task is to learn a more comprehensive and semantic-rich image representation that can adaptively attend to the visual content referred by variant questions. In this paper, we first propose a novel scheme to depict an image from both visual and semantic views. Specifically, the visual view aims to capture the appearance-level information in an image, including objects and their visual relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Furthermore, on top of such dual-view image representations, we propose a Dual Encoding Visual Dialogue (DualVD) module, which is able to adaptively select question-relevant information from the visual and semantic views in a hierarchical mode. To demonstrate the effectiveness of DualVD, we propose two novel visual dialogue models by applying it to the Late Fusion framework and Memory Network framework. The proposed models achieve state-of-the-art results on three benchmark datasets. A critical advantage of the DualVD module lies in its interpretability. We can analyze which modality (visual or semantic) has more contribution in answering the current question by explicitly visualizing the gate values. It gives us insights in understanding of information selection mode in the Visual Dialogue task. The code is available at https://github.com/JXZe/Learning_DualVD .

中文翻译：

在视觉对话中学习双重编码模型以进行自适应视觉理解

与仅要求回答一个图像问题的“视觉提问”任务不同，“视觉对话”任务涉及多轮对话，这些对话涵盖了可能与任何对象，关系或高级语义相关的广泛视觉内容。因此，“视觉对话”任务中的关键挑战之一就是学习一种更全面，语义更丰富的图像表示形式，该图像表示形式可以自适应地适应各种问题所引用的视觉内容。在本文中，我们首先提出一种从视觉和语义两个角度描绘图像的新颖方案。具体而言，视觉视图旨在捕获图像中的外观级别信息，包括对象及其视觉关系，而语义视图使代理能够理解从整个图像到局部区域的高级视觉语义。此外，在这种双视图图像表示的基础上，我们提出了双编码视觉对话（DualVD）模块，该模块能够以分层模式从视觉和语义视图中自适应选择与问题相关的信息。为了证明DualVD的有效性，我们通过将其应用于Late Fusion框架和Memory Network框架，提出了两种新颖的视觉对话模型。所提出的模型在三个基准数据集上获得了最新的结果。DualVD模块的关键优势在于其可解释性。通过显式可视化门数值，我们可以分析哪种模式（视觉或语义）在回答当前问题方面有更大的贡献。它为我们提供了在可视对话任务中了解信息选择模式的见解。该代码位于https://github.com/JXZe/Learning_DualVD 。

更新日期：2020-11-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11