Textual-Visual Reference-Aware Attention Network for Visual Dialog,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Textual-Visual Reference-Aware Attention Network for Visual Dialog
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2020-05-18 , DOI: 10.1109/tip.2020.2992888
Dan Guo , Hui Wang , Shuhui Wang , Meng Wang

Visual dialog is a challenging task in multimedia understanding, which requires the dialog agent to answer a series of questions that are based on an input image. The critical issue to produce an exact answer is how to model the mutual semantic interaction among feature representations of the image, question-answer history, and current question. In this study, we propose a textual-visual Reference-Aware Attention Network (RAA-Net), which aims to effectively fuse

$Q$

(question),

$H$

(history),

$V_{l}$

(local vision), and

$V_{g}$

(global vision) to infer the exact answer. In the multimodal feature flows, RAA-Net first learns the textual context through multi-head attention between

$Q$

and

$H$

and then guides the textual reference semantics to the image to capture visual reference semantics by self- and cross- reference-aware attention in and between

$V_{l}$

and

$V_{g}$

. In the proposed RAA-Net, we exploit the two-stage (intra- and inter-) visual reasoning mechanism on

$V_{l}$

and

$V_{g}$

. Extensive experiments on the VisDial v0.9 and v1.0 datasets show that RAA-Net achieves state-of-the-art performance. Visualization results on both visual and textual attention maps further validate the remarkable interpretability achieved by our solution.

中文翻译：

视觉对话框的文本视觉参考感知注意网络

在多媒体理解中，视觉对话是一项具有挑战性的任务，它要求对话代理回答一系列基于输入图像的问题。产生准确答案的关键问题是如何对图像的特征表示，问答历史和当前问题之间的相互语义交互建模。在这项研究中，我们提出了一个文本视觉参考感知注意网络（RAA-Net），旨在有效融合

$ Q $

（题），

$ H $

（历史），

$ V_ {l} $

（本地视野），以及

$ V_ {g} $

（全球视野）推断出确切答案。在多模式特征流中，RAA-Net首先通过多头关注来学习文本上下文。

$ Q $

和

$ H $

然后将文本参考语义引导到图像，以通过自身和交叉参考感知的注意以及两者之间的注意来捕获视觉参考语义