Adversarial Multimodal Network for Movie Story Question Answering,IEEE Transactions on Multimedia

当前位置： X-MOL 学术 › IEEE Trans. Multimedia › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Adversarial Multimodal Network for Movie Story Question Answering
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-06-15 , DOI: 10.1109/tmm.2020.3002667
Zhaoquan Yuan , Siyuan Sun , Lixin Duan , Changsheng Li , Xiao Wu , Changsheng Xu

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

中文翻译：

用于电影故事问答的对抗性多模态网络

近年来，利用多种方式的信息进行视觉问答引起了越来越多的关注。然而，这是一项非常具有挑战性的任务，因为视觉内容和自然语言具有完全不同的统计特性。在这项工作中，我们提出了一种称为对抗性多模态网络（AMN）的方法，以更好地理解视频故事以进行问答。在 AMN 中，我们建议通过基于生成对抗网络为视频剪辑和相应文本（例如字幕和问题）找到更连贯的子空间来学习多模态特征表示。此外，开发了一种自注意力机制来强制我们新引入的一致性约束，以便在学习的多模态表示中保留原始视频剪辑的视觉线索之间的自相关性。对基准 MovieQA 和 TVQA 数据集的大量实验表明，我们提出的 AMN 相对于其他已发布的最先进方法的有效性。

更新日期：2020-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11