Complementary spatiotemporal network for video question answering,Multimedia Systems

当前位置： X-MOL 学术 › Multimedia Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Complementary spatiotemporal network for video question answering
Multimedia Systems ( IF 3.9 ) Pub Date : 2021-06-02 , DOI: 10.1007/s00530-021-00805-6
Xinrui Li , Aming Wu , Yahong Han

Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds in the video, these methods will be difficult to distinguish spatial details and different scale motion relationships. To this end, we propose a complementary spatiotemporal network (CST) to focus on multi-scale motion relationships and essential spatial semantics. Our model involves three modules. First, multi-scale relation unit (MR) captures temporal information by modeling different distances between motions. Second, mask similarity (MS) operation captures discriminative spatial semantics in a less redundant manner. And cross-modality attention (CMA) boosts the interaction between different modalities. We evaluate our method on three benchmark datasets and conduct extensive ablation studies. The performance improvement demonstrates the effectiveness of our approach.

中文翻译：

用于视频问答的互补时空网络

视频问答 (VideoQA) 具有挑战性，因为它需要模型来捕捉运动和空间语义并将它们与语言上下文相关联。最近的方法通常对称地处理空间和时间。由于视频中的空间结构和时间事件往往以不同的速度变化，这些方法将难以区分空间细节和不同尺度的运动关系。为此，我们提出了一个互补的时空网络（CST）来关注多尺度运动关系和基本空间语义。我们的模型涉及三个模块。首先，多尺度关系单元 (MR) 通过对运动之间的不同距离进行建模来捕获时间信息。其次，掩码相似性 (MS) 操作以较少冗余的方式捕获有区别的空间语义。跨模态注意力（CMA）促进了不同模态之间的交互。我们在三个基准数据集上评估我们的方法并进行广泛的消融研究。性能改进证明了我们方法的有效性。

更新日期：2021-06-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>