Multi-modal Summarization for Video-containing Documents,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-modal Summarization for Video-containing Documents
arXiv - CS - Information Retrieval Pub Date : 2020-09-17 , DOI: arxiv-2009.08018
Xiyan Fu and Jun Wang and Zhenglu Yang

Summarization of multimedia data becomes increasingly significant as it is the basis for many real-world applications, such as question answering, Web search, and so forth. Most existing multi-modal summarization works however have used visual complementary features extracted from images rather than videos, thereby losing abundant information. Hence, we propose a novel multi-modal summarization task to summarize from a document and its associated video. In this work, we also build a baseline general model with effective strategies, i.e., bi-hop attention and improved late fusion mechanisms to bridge the gap between different modalities, and a bi-stream summarization strategy to employ text and video summarization simultaneously. Comprehensive experiments show that the proposed model is beneficial for multi-modal summarization and superior to existing methods. Moreover, we collect a novel dataset and it provides a new resource for future study that results from documents and videos.

中文翻译：

含视频文档的多模态摘要

多媒体数据的汇总变得越来越重要，因为它是许多实际应用程序（例如问答、Web 搜索等）的基础。然而，大多数现有的多模态摘要工作使用从图像而不是视频中提取的视觉互补特征，从而丢失了丰富的信息。因此，我们提出了一种新颖的多模态摘要任务来从文档及其相关视频中进行摘要。在这项工作中，我们还构建了一个具有有效策略的基线通用模型，即双跳注意和改进的后期融合机制以弥合不同模式之间的差距，以及同时使用文本和视频摘要的双流摘要策略。综合实验表明，所提出的模型有利于多模态摘要，优于现有方法。此外，我们收集了一个新的数据集，它为未来的研究提供了新的资源，这些资源来自文档和视频。

更新日期：2020-09-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文