Graph-based Multimodal Ranking Models for Multimodal Summarization,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Graph-based Multimodal Ranking Models for Multimodal Summarization
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2021-05-26 , DOI: 10.1145/3445794
Junnan Zhu ₁ , Lu Xiang ₁ , Yu Zhou ₂ , Jiajun Zhang ₃ , Chengqing Zong ₁

Affiliation

Multimodal summarization aims to extract the most important information from the multimedia input. It is becoming increasingly popular due to the rapid growth of multimedia data in recent years. There are various researches focusing on different multimodal summarization tasks. However, the existing methods can only generate single-modal output or multimodal output. In addition, most of them need a lot of annotated samples for training, which makes it difficult to be generalized to other tasks or domains. Motivated by this, we propose a unified framework for multimodal summarization that can cover both single-modal output summarization and multimodal output summarization. In our framework, we consider three different scenarios and propose the respective unsupervised graph-based multimodal summarization models without the requirement of any manually annotated document-summary pairs for training: (1) generic multimodal ranking, (2) modal-dominated multimodal ranking, and (3) non-redundant text-image multimodal ranking. Furthermore, an image-text similarity estimation model is introduced to measure the semantic similarity between image and text. Experiments show that our proposed models outperform the single-modal summarization methods on both automatic and human evaluation metrics. Besides, our models can also improve the single-modal summarization with the guidance of the multimedia information. This study can be applied as the benchmark for further study on multimodal summarization task.

中文翻译：

用于多模态汇总的基于图的多模态排序模型

多模式摘要旨在从多媒体输入中提取最重要的信息。由于近年来多媒体数据的快速增长，它变得越来越流行。有各种研究侧重于不同的多模态摘要任务。然而，现有的方法只能产生单模态输出或多模态输出。此外，它们中的大多数都需要大量带注释的样本进行训练，这使得难以推广到其他任务或领域。受此启发，我们提出了一个统一的多模态摘要框架，可以涵盖单模态输出摘要和多模态输出摘要。在我们的框架中，我们考虑了三种不同的场景，并提出了各自的基于无监督图的多模态摘要模型，而不需要任何手动注释的文档-摘要对进行训练：(1) 通用多模态排序，(2) 模态主导的多模态排序，以及 (3)非冗余文本图像多模式排名。此外，还引入了图文相似度估计模型来衡量图文之间的语义相似度。实验表明，我们提出的模型在自动和人工评估指标上都优于单模态摘要方法。此外，我们的模型还可以在多媒体信息的指导下改进单模态摘要。该研究可以作为进一步研究多模态摘要任务的基准。(1) 通用多模态排序，(2) 模态主导的多模态排序，以及 (3) 非冗余文本图像多模态排序。此外，还引入了图文相似度估计模型来衡量图文之间的语义相似度。实验表明，我们提出的模型在自动和人工评估指标上都优于单模态摘要方法。此外，我们的模型还可以在多媒体信息的指导下改进单模态摘要。该研究可以作为进一步研究多模态摘要任务的基准。(1) 通用多模态排序，(2) 模态主导的多模态排序，以及 (3) 非冗余文本图像多模态排序。此外，还引入了图文相似度估计模型来衡量图文之间的语义相似度。实验表明，我们提出的模型在自动和人工评估指标上都优于单模态摘要方法。此外，我们的模型还可以在多媒体信息的指导下改进单模态摘要。该研究可以作为进一步研究多模态摘要任务的基准。实验表明，我们提出的模型在自动和人工评估指标上都优于单模态摘要方法。此外，我们的模型还可以在多媒体信息的指导下改进单模态摘要。该研究可以作为进一步研究多模态摘要任务的基准。实验表明，我们提出的模型在自动和人工评估指标上都优于单模态摘要方法。此外，我们的模型还可以在多媒体信息的指导下改进单模态摘要。该研究可以作为进一步研究多模态摘要任务的基准。

更新日期：2021-05-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>