Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks,IEEE Transactions on Circuits and Systems for Video Technology

当前位置： X-MOL 学术 › IEEE Trans. Circ. Syst. Video Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.3 ) Pub Date : 2020-12-01 , DOI: 10.1109/tcsvt.2019.2957309
Mao Gu , Zhou Zhao , Weike Jin , Deng Cai , Fei Wu

Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. However, the existing methods of image dialog and video question answering, which fail to process the complexity of video information and establish the logical dependency of history contexts, are inappropriate to be applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions that contain multi-grained sequence information. Moreover, a hierarchical dialog history encoder is designed to learn the context-aware question representation. Finally, we establish two decoders in multiple-choice and open-ended forms respectively, which utilize different strategies to get the multi-model context-aware video representation and to generate human-like answers. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.

中文翻译：

通过多粒度卷积自注意力上下文多模网络的视频对话

视频对话是一项新的且具有挑战性的任务，它需要人工智能代理以自然语言与人类就视频内容保持有意义的对话。具体来说，给定一个视频、一段对话历史和一个关于视频的新问题，代理必须将视频信息与对话历史结合起来来推断答案。然而，现有的图像对话和视频问答方法无法处理视频信息的复杂性和建立历史上下文的逻辑依赖关系，不适合直接应用于视频对话。在本文中，我们提出了一种新的视频对话方法，称为多粒度卷积自注意上下文网络，它将视频信息与对话历史相结合。而不是使用 RNN 来编码序列信息，我们设计了一种多粒度卷积自注意力机制来捕获包含多粒度序列信息的元素和段级交互。此外，分层对话历史编码器旨在学习上下文感知问题表示。最后，我们分别以多项选择和开放式形式建立了两个解码器，它们利用不同的策略来获得多模型上下文感知视频表示并生成类人答案。我们在两个大规模数据集上评估我们的方法。由于新注意力机制的灵活性和并行性，我们的方法可以实现更高的时间效率，大量的实验也证明了我们方法的有效性。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11