Exploring global diverse attention via pairwise temporal relation for video summarization,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring global diverse attention via pairwise temporal relation for video summarization
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-09-23 , DOI: arxiv-2009.10942
Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, Ling Shao

Video summarization is an effective way to facilitate video searching and browsing. Most of existing systems employ encoder-decoder based recurrent neural networks, which fail to explicitly diversify the system-generated summary frames while requiring intensive computations. In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames. Particularly, the GDA module has two advantages: 1) it models the relations within paired frames as well as the relations among all pairs, thus capturing the global attention across all frames of one video; 2) it reflects the importance of each frame to the whole video, leading to diverse attention on these frames. Thus, SUM-GDA is beneficial for generating diverse frames to form satisfactory video summary. Extensive experiments on three data sets, i.e., SumMe, TVSum, and VTW, have demonstrated that SUM-GDA and its extension outperform other competing state-of-the-art methods with remarkable improvements. In addition, the proposed models can be run in parallel with significantly less computational costs, which helps the deployment in highly demanding applications.

中文翻译：

通过视频摘要的成对时间关系探索全球多样化的注意力

视频摘要是方便视频搜索和浏览的有效方式。大多数现有系统采用基于编码器-解码器的循环神经网络，在需要大量计算的同时，无法明确地使系统生成的摘要帧多样化。在本文中，我们通过称为 SUM-GDA 的 Global Diverse Attention 提出了一种用于视频 SUMmarization 的高效卷积神经网络架构，该架构在全局角度调整注意力机制以考虑视频帧的成对时间关系。特别是，GDA 模块有两个优点：1）它对成对帧内的关系以及所有对之间的关系进行建模，从而在一个视频的所有帧中捕获全局注意力；2）它反映了每一帧对整个视频的重要性，导致对这些帧的不同关注。因此，SUM-GDA 有利于生成不同的帧以形成令人满意的视频摘要。对三个数据集（即 SumMe、TVSum 和 VTW）的大量实验表明，SUM-GDA 及其扩展以显着的改进优于其他竞争的最先进方法。此外，所提出的模型可以并行运行，计算成本显着降低，这有助于在高要求的应用程序中进行部署。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文