Video summarization with a dual-path attentive network,Neurocomputing

当前位置： X-MOL 学术 › Neurocomputing › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video summarization with a dual-path attentive network
Neurocomputing ( IF 6 ) Pub Date : 2021-09-13 , DOI: 10.1016/j.neucom.2021.09.015
Guoqiang Liang ₁ , Yanbing Lv ₁ , Shucheng Li ₁ , Xiahong Wang ₁ , Yanning Zhang ₁

Affiliation

With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.

中文翻译：

使用双路径注意力网络进行视频摘要

随着每天拍摄的视频呈爆炸式增长，如何有效地从视频中提取有用的信息已成为越来越重要的问题。作为最有效的方法之一，旨在提取最重要的帧或镜头的视频摘要最近引起了更多的兴趣。目前，许多方法采用循环结构。然而，由于其循序渐进的特点，这些模型很难并行化。为了解决这个问题，我们提出了一个双路径注意力视频摘要框架，由一个时间空间编码器、一个分数感知编码器和一个解码器组成。并且它们都主要基于多头自注意力和卷积块注意力模块。时间空间编码器将捕获时间和空间信息，而分数感知编码器将外观特征与先前预测的帧级重要性分数相结合。通过结合分数和外观特征，我们的模型可以更好地捕获远程全局依赖关系并不断更新前一帧的重要性分数。此外，完全基于注意力机制，我们的模型可以完全并行训练，从而减少训练时间。为了验证该方法，我们使用了两个流行的数据集 SumMe 和 TVSum。实验结果表明了所提出方法的有效性。我们的模型可以更好地捕获远程全局依赖关系并不断更新前一帧的重要性分数。此外，完全基于注意力机制，我们的模型可以完全并行训练，从而减少训练时间。为了验证该方法，我们使用了两个流行的数据集 SumMe 和 TVSum。实验结果表明了所提出方法的有效性。我们的模型可以更好地捕获远程全局依赖关系并不断更新前一帧的重要性分数。此外，完全基于注意力机制，我们的模型可以完全并行训练，从而减少训练时间。为了验证该方法，我们使用了两个流行的数据集 SumMe 和 TVSum。实验结果表明了所提出方法的有效性。

更新日期：2021-10-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>