Recurrent Attention Network with Reinforced Generator for Visual Dialog,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Recurrent Attention Network with Reinforced Generator for Visual Dialog
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-07-06 , DOI: 10.1145/3390891
Hehe Fan ₁ , Linchao Zhu ₂ , Yi Yang ₂ , Fei Wu ₃

Affiliation

In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

中文翻译：

具有增强的视觉对话生成器的循环注意网络

在视觉对话中，代理必须解析对话历史中的时间上下文和图像中的空间上下文，才能与人类进行有意义的对话。例如，回答“她左边的男人穿什么？” 代理需要 (1) 分析对话历史中的时间上下文以推断谁被称为“她”，(2) 解析图像以关注“她”，以及 (3) 揭示空间上下文以转移注意“她的左边”并检查该男子的服装。在本文中，我们使用对话网络来记忆时间上下文，并使用注意力处理器来解析空间上下文。由于问题和图像通常非常复杂，这使得问题很难以一瞥为基础，因此注意力处理器会多次关注图像以更好地收集视觉信息。在 Visual Dialog 任务中，生成解码器 (G) 在逐字范式下进行训练，该范式缺乏句子级训练。我们建议使用判别模型（D）在句子级别强化 G，该模型旨在从少数候选者中选择正确答案，以改善问题。VisDial 数据集的实验结果证明了我们方法的有效性。

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>