Macroscopic Control of Text Generation for Image Captioning,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Macroscopic Control of Text Generation for Image Captioning
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-01-20 , DOI: arxiv-2101.08000
Zhangzi Zhu, Tianlei Wang, Hong Qu

Despite the fact that image captioning models have been able to generate impressive descriptions for a given image, challenges remain: (1) the controllability and diversity of existing models are still far from satisfactory; (2) models sometimes may produce extremely poor-quality captions. In this paper, two novel methods are introduced to solve the problems respectively. Specifically, for the former problem, we introduce a control signal which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc. With such a control signal, the controllability and diversity of existing captioning models are enhanced. For the latter problem, we innovatively propose a strategy that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one. As a result, this strategy can effectively reduce the proportion of poorquality sentences. Our proposed methods can be easily applie on most image captioning models to improve their overall performance. Based on the Up-Down model, the experimental results show that our methods achieve BLEU- 4/CIDEr/SPICE scores of 37.5/120.3/21.5 on MSCOCO Karpathy test split with cross-entropy training, which surpass the results of other state-of-the-art methods trained by cross-entropy loss.

中文翻译：

用于图像字幕的文本生成的宏观控制

尽管图像字幕模型已经能够为给定图像生成令人印象深刻的描述，但仍然存在挑战：（1）现有模型的可控制性和多样性仍然远远不能令人满意；（2）模型有时可能会产生质量很差的字幕。本文介绍了两种新颖的方法分别解决这些问题。具体而言，针对前一个问题，我们引入了一种控制信号，该信号可以控制宏观句子属性，例如句子质量，句子长度，句子时态和名词数量等。利用这种控制信号，现有字幕模型的可控性和多样性被增强。对于后一个问题，我们创新地提出了一种策略，即训练图像文本匹配模型以测量前后方向上生成的句子的质量，并最终选择更好的策略。结果，该策略可以有效地减少劣质句子的比例。我们提出的方法可以轻松地应用于大多数图像字幕模型，以提高其总体性能。基于Up-Down模型，实验结果表明，在采用交叉熵训练的MSCOCO Karpathy检验分割上，我们的方法获得的BLEU-4 / CIDEr / SPICE得分为37.5 / 120.3 / 21.5，超过了其他状态下的结果通过交叉熵损失训练的最新方法。我们提出的方法可以轻松地应用于大多数图像字幕模型，以提高其总体性能。基于Up-Down模型，实验结果表明，在采用交叉熵训练的MSCOCO Karpathy检验分割上，我们的方法获得的BLEU-4 / CIDEr / SPICE得分为37.5 / 120.3 / 21.5，超过了其他状态下的结果通过交叉熵损失训练的最新方法。我们提出的方法可以轻松地应用于大多数图像字幕模型，以提高其总体性能。基于Up-Down模型，实验结果表明，在采用交叉熵训练的MSCOCO Karpathy检验分割上，我们的方法获得的BLEU-4 / CIDEr / SPICE得分为37.5 / 120.3 / 21.5，超过了其他状态下的结果通过交叉熵损失训练的最新方法。

更新日期：2021-01-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文