GuessWhich? Visual dialog with attentive memory network,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GuessWhich? Visual dialog with attentive memory network
Pattern Recognition ( IF 8 ) Pub Date : 2021-01-14 , DOI: 10.1016/j.patcog.2021.107823
Lei Zhao , Xinyu Lyu , Jingkuan Song , Lianli Gao

Visual dialog is a task that two agents: Question-BOT (Q-BOT) and Answer-BOT (A-BOT), which communicate in natural language on the situation of information asymmetry. Q-BOT generates questions based on an image caption and a historical dialog. A-BOT answers the questions grounded on the image. Moreover, we play a cooperative ‘image guessing’ game between Q-BOT and A-BOT, so that Q-BOT can select an unseen image from a set of images. However, as the valid information of the image caption and the historical dialog fades along the interaction, existing methods usually generate irrelevant and homogenous questions, which are worthless to the visual dialog system. To tackle this issue, we propose an Attentive Memory Network (AMN) to fully exploit the image caption and historical dialog information. Specifically, the attentive memory network mainly consists of a memory network and a fusion module. The memory network holds long term historical dialog information and gives each round of the dialog a different weight. Aside from the historical dialog information, the fusion module in Q-BOT and A-BOT further uses the image caption and the image feature, respectively. The caption information assists Q-BOT with the attentive generation of the questions, and the image feature helps A-BOT produce precise answers. With the AMN, the generated questions are diverse and concentrated, and the corresponding answers are accurate. The experimental results on VisDial v1.0 show the effectiveness of our proposed model, which outperforms the state-of-the-art methods.

中文翻译：

猜猜是哪个？带有注意力记忆网络的可视对话框

视觉对话是一项由两个代理组成的任务：Question-BOT（Q-BOT）和Answer-BOT（A-BOT），它们以自然语言就信息不对称情况进行交流。Q-BOT根据图像标题和历史对话生成问题。A-BOT回答了基于图像的问题。此外，我们在Q-BOT和A-BOT之间进行合作的“图像猜测”游戏，以便Q-BOT可以从一组图像中选择看不见的图像。然而，随着图像标题和历史对话的有效信息随着交互作用而逐渐消失，现有的方法通常会产生不相关且同质的问题，这对于视觉对话系统是毫无价值的。为了解决这个问题，我们提出了一个ttentive中号埃默里ñetwork（AMN）可以充分利用图像标题和历史对话信息。具体地，注意力存储网络主要由存储网络和融合模块组成。内存网络保存长期的历史对话信息，并为每一轮对话赋予不同的权重。除了历史对话信息之外，Q-BOT和A-BOT中的融合模块还分别使用图像标题和图像功能。字幕信息可帮助Q-BOT细心生成问题，图像功能可帮助A-BOT产生准确的答案。使用AMN，生成的问题是多样且集中的，并且相应的答案是准确的。VisDial v1.0上的实验结果证明了我们提出的模型的有效性，该模型优于最新方法。

更新日期：2021-02-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>