An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2020-10-15 , DOI: 10.1109/tip.2020.3028651
Min Yang , Junhao Liu , Ying Shen , Zhou Zhao , Xiaojun Chen , Qingyao Wu , Chengming Li

Image captioning, which aims to generate a sentence to describe the key content of a query image, is an important but challenging task. Existing image captioning approaches can be categorised into two types: generation-based methods and retrieval-based methods. Retrieval-based methods describe images by retrieving pre-existing captions from a repository. Generation-based methods synthesize a new sentence that verbalizes the query image. Both ways have certain advantages but suffer from their own disadvantages. In the paper, we propose a novel EnsCaption model, which aims at enhancing an ensemble of retrieval-based and generation-based image captioning methods through a novel dual generator generative adversarial network. Specifically, EnsCaption is composed of a caption generation model that synthesizes tailored captions for the query image, a caption re-ranking model that retrieves the best-matching caption from a candidate caption pool consisting of generated captions and pre-retrieved captions, and a discriminator that learns the multi-level difference between the generated/retrieved captions and the ground-truth captions. During the adversarial training process, the caption generation model and the caption re-ranking model provide improved synthetic and retrieved candidate captions with high ranking scores from the discriminator, while the discriminator based on multi-level ranking is trained to assign low ranking scores to the generated and retrieved image captions. Our model absorbs the merits of both generation-based and retrieval-based approaches. We conduct comprehensive experiments to evaluate the performance of EnsCaption on two benchmark datasets: MSCOCO and Flickr-30K. Experimental results show that EnsCaption achieves impressive performance compared to the strong baseline methods.

中文翻译：

具有双生成器生成对抗网络的基于生成和检索的图像描述的集成

图像字幕旨在生成一个句子来描述查询图像的关键内容，是一项重要但具有挑战性的任务。现有的图像描述方法可以分为两类：基于生成的方法和基于检索的方法。基于检索的方法通过从存储库检索预先存在的标题来描述图像。基于生成的方法合成一个用语言表达查询图像的新句子。两种方式都有一定的优点，但都有各自的缺点。在论文中，我们提出了一部小说字幕模型，旨在通过新颖的双生成器生成对抗网络增强基于检索和基于生成的图像字幕方法的集合。具体来说，字幕由一个为查询图像合成定制字幕的字幕生成模型、一个从由生成字幕和预先检索的字幕组成的候选字幕池中检索最匹配字幕的字幕重排序模型以及一个学习生成/检索的字幕和真实字幕之间的多级差异。在对抗训练过程中，字幕生成模型和字幕重排序模型提供了改进的合成和检索的候选字幕，这些候选字幕具有来自判别器的高排名分数，而基于多级排名的判别器被训练为将低排名分数分配给候选字幕。生成并检索图像标题。我们的模型吸收了基于生成和基于检索的方法的优点。我们进行了全面的实验来评估 EnsCaption 在两个基准数据集：MSCOCO 和 Flickr-30K 上的性能。实验结果表明，与强大的基线方法相比，EnsCaption 取得了令人印象深刻的性能。

更新日期：2020-10-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11