Supervised Video Summarization via Multiple Feature Sets with Parallel Attention,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Supervised Video Summarization via Multiple Feature Sets with Parallel Attention
arXiv - CS - Multimedia Pub Date : 2021-04-23 , DOI: arxiv-2104.11530
Junaid Ahmed Ghauri, Sherzod Hakimov, Ralph Ewerth

The assignment of importance scores to particular frames or (short) segments in a video is crucial for summarization, but also a difficult task. Previous work utilizes only one source of visual features. In this paper, we suggest a novel model architecture that combines three feature sets for visual content and motion to predict importance scores. The proposed architecture utilizes an attention mechanism before fusing motion features and features representing the (static) visual content, i.e., derived from an image classification model. Comprehensive experimental evaluations are reported for two well-known datasets, SumMe and TVSum. In this context, we identify methodological issues on how previous work used these benchmark datasets, and present a fair evaluation scheme with appropriate data splits that can be used in future work. When using static and motion features with parallel attention mechanism, we improve state-of-the-art results for SumMe, while being on par with the state of the art for the other dataset.

中文翻译：

通过多个功能集并同时进行监督的视频摘要

将重要性得分分配给视频中的特定帧或（短）片段对于汇总至关重要，但这也是一项艰巨的任务。先前的作品仅利用一种视觉特征来源。在本文中，我们提出了一种新颖的模型体系结构，该体系结构结合了视觉内容和运动的三个特征集来预测重要性得分。所提出的架构在融合运动特征和代表（静态）视觉内容的特征（即，从图像分类模型派生）之前利用注意力机制。报告了两个著名的数据集SumMe和TVSum的综合实验评估。在这种情况下，我们确定了有关先前工作如何使用这些基准数据集的方法论问题，并提出了一个公平的评估方案，其中包括可以在将来的工作中使用的适当数据拆分方法。

更新日期：2021-04-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文