Multi-Modal fusion with multi-level attention for Visual Dialog,Information Processing & Management

当前位置： X-MOL 学术 › Inf. Process. Manag. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-Modal fusion with multi-level attention for Visual Dialog
Information Processing & Management ( IF 8.6 ) Pub Date : 2019-11-11 , DOI: 10.1016/j.ipm.2019.102152
Jingping Zhang , Qiang Wang , Yahong Han

Given an input image, Visual Dialog is introduced to answer a sequence of questions in the form of a dialog. To generate accurate answers for questions in the dialog, we need to consider all information of the dialog history, the question, and the image. However, existing methods usually directly utilized the high-level semantic information of the whole sentence for the dialog history and the question, while ignoring the low-level detailed information of words in the sentence. Similarly, the detailed region information of the image in low level is also required to be considered for question answering. Therefore, we propose a novel visual dialog method, which focuses on both high-level and low-level information of the dialog history, the question, and the image. In our approach, we introduce three low-level attention modules, the goal of which is to enhance the representation of words in the sentence of the dialog history and the question based on the word-to-word connection and enrich the region information of the image based on the region-to-region relation. Besides, we design three high-level attention modules to select important words in the sentence of the dialog history and the question as the supplement of the detailed information for semantic understanding, as well as to select relevant regions in the image to provide the targeted visual information for question answering. We evaluate the proposed approach on two datasets: VisDial v0.9 and VisDial v1.0. The experimental results demonstrate that utilizing both low-level and high-level information really enhances the representation of inputs.

中文翻译：

视觉对话的多模式融合与多层次关注

给定一个输入图像，引入可视对话框以对话框的形式回答一系列问题。为了对对话框中的问题生成准确的答案，我们需要考虑对话框历史，问题和图像的所有信息。然而，现有方法通常直接将整个句子的高级语义信息用于对话历史和问题，而忽略句子中单词的低级详细信息。类似地，还需要考虑低级图像的详细区域信息来回答问题。因此，我们提出了一种新颖的可视对话方法，该方法集中于对话历史，问题和图像的高级和低级信息。在我们的方法中，我们引入了三个低层次的关注模块，其目的是基于单词到单词的连接来增强对话历史的句子和问题中单词的表示，并基于区域到区域的关系来丰富图像的区域信息。此外，我们设计了三个高级注意模块，以选择对话历史和问题中的重要单词作为语义理解的详细信息的补充，并选择图像中的相关区域以提供目标视觉用于回答问题的信息。我们在两个数据集上评估提出的方法：VisDial v0.9和VisDial v1.0。实验结果表明，同时使用低级和高级信息确实可以增强输入的表示。

更新日期：2020-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>