See, hear, read: Leveraging multimodality with guided attention for abstractive text summarization,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

See, hear, read: Leveraging multimodality with guided attention for abstractive text summarization
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2021-05-26 , DOI: 10.1016/j.knosys.2021.107152
Yash Kumar Atri , Shraman Pramanick , Vikram Goyal , Tanmoy Chakraborty

In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths.

In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task. FLORAL utilizes an increasing number of self-attentions to capture multimodality and performs significantly better than traditional encoder–decoder based networks. Extensive experiments illustrate that FLORAL achieves significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset for short videos and newly introduced AVIATE dataset for videos with diverse duration, beating the best baseline on the two datasets by 1.39 and 2.74 ROUGE-L points respectively.

中文翻译：

看、听、读：利用多模态和引导注意进行抽象文本摘要

近年来，具有多模态输入的抽象文本摘要因其能够从不同源模态积累信息并生成流畅的文本摘要而开始引起人们的注意。然而，现有方法使用短视频作为视觉形式，短摘要作为真实情况，因此在长视频和长真实情况摘要上表现不佳。此外，没有基准数据集可以在不同长度的视频上推广此任务。

在本文中，我们介绍了 AVIATE，这是第一个用于抽象文本摘要的大规模数据集，包含不同时长的视频，由 NDSS、ICML、NeurIPS 等知名学术会议的演讲汇编而成。我们使用相应研究论文的摘要作为参考摘要，以确保地面实况的足够质量和一致性。然后，我们提出了FLORAL，这是一种基于分解多模态 Transformer 的仅解码器语言模型，它本质上捕获了文本摘要任务的各种输入模态内的模态内和模态间动态。花的利用越来越多的自注意力来捕获多模态，并且性能明显优于传统的基于编码器-解码器的网络。大量实验表明，FLORAL在现有的短视频 How2 数据集和新引入的用于不同时长视频的 AVIATE 数据集的定性和定量评估中均取得了显着优于基线的显着改进，比两个数据集上的最佳基线高 1.39 和 2.74 ROUGE- L 点分别。

更新日期：2021-06-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11