当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
arXiv - CS - Multimedia Pub Date : 2021-04-26 , DOI: arxiv-2104.12465
Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring

Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.

中文翻译:

GPT2MVS:用于多模式视频摘要的生成式预训练Transformer-2

传统的视频摘要方法会生成固定的视频表示形式,而与用户的兴趣无关。因此,这种方法限制了用户在内容搜索和探索场景中的期望。多模式视频摘要是用于解决此问题的方法之一。当多模式视频摘要用于帮助视频探索时,基于文本的查询被视为视频摘要生成的主要驱动程序之一,因为它是用户定义的。因此,对基于文本的查询和视频进行有效编码对于多模式视频摘要的任务都很重要。在这项工作中,提出了一种新方法,该方法使用专门的注意力网络和上下文化的单词表示来解决此任务。提议的模型包括一个上下文相关的视频摘要控制器,多模式注意力机制,互动式注意力网络和视频摘要生成器。在对现有多模式视频摘要基准进行评估的基础上,实验结果表明,与最新状态相比,该模型的精度提高了+ 5.88%,F1得分提高了+ 4.06%艺术方法。
更新日期:2021-04-27
down
wechat
bug