ReFormer: The Relational Transformer for Image Captioning,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ReFormer: The Relational Transformer for Image Captioning
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-07-29 , DOI: arxiv-2107.14178
Xuewen Yang, Yingru Liu, Xin Wang

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer -- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the bene-fit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relation-ships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation

中文翻译：

ReFormer：用于图像字幕的关系转换器

通过使用场景图来表示图像中对象的关系，图像字幕显示能够获得更好的性能。当前的字幕编码器通常使用图卷积网络（GCN）来表示关系信息，并通过串联或卷积将其与对象区域特征合并，以获得句子解码的最终输入。然而，由于两个原因，现有方法中基于 GCN 的编码器对字幕效果较差。首先，使用图像字幕作为目标（即最大似然估计）而不是以关系为中心的损失无法充分探索编码器的潜力。其次，使用预先训练的模型而不是编码器本身来提取关系是不灵活的，并且无法提高模型的可解释性。为了提高图像字幕的质量，我们提出了一种新颖的架构 ReFormer——一种关系变换器，用于生成嵌入关系信息的特征，并明确表达图像中对象之间的成对关系。ReFormer 使用一种改进的 Transformer 模型将场景图生成的目标与图像字幕的目标相结合。这种设计不仅允许 ReFormer 生成更好的图像标题，受益于提取强关系图像特征，而且还可以生成场景图来明确描述成对关系。公开数据集的实验表明，我们的模型在图像字幕和场景图生成方面明显优于最先进的方法我们提出了一种新颖的架构 ReFormer——一种关系变换器，用于生成嵌入关系信息的特征，并明确表达图像中对象之间的成对关系。ReFormer 使用一种改进的 Transformer 模型将场景图生成的目标与图像字幕的目标相结合。这种设计不仅允许 ReFormer 生成更好的图像标题，受益于提取强关系图像特征，而且还可以生成场景图来明确描述成对关系。公开数据集的实验表明，我们的模型在图像字幕和场景图生成方面明显优于最先进的方法我们提出了一种新颖的架构 ReFormer——一种关系变换器，用于生成嵌入关系信息的特征，并明确表达图像中对象之间的成对关系。ReFormer 使用一种改进的 Transformer 模型将场景图生成的目标与图像字幕的目标相结合。这种设计不仅允许 ReFormer 生成更好的图像标题，受益于提取强关系图像特征，而且还可以生成场景图来明确描述成对关系。公开数据集的实验表明，我们的模型在图像字幕和场景图生成方面明显优于最先进的方法 ReFormer 使用一种改进的 Transformer 模型将场景图生成的目标与图像字幕的目标相结合。这种设计不仅允许 ReFormer 生成更好的图像标题，受益于提取强关系图像特征，而且还可以生成场景图来明确描述成对关系。公开数据集的实验表明，我们的模型在图像字幕和场景图生成方面明显优于最先进的方法 ReFormer 使用一种改进的 Transformer 模型将场景图生成的目标与图像字幕的目标相结合。这种设计不仅允许 ReFormer 生成更好的图像标题，受益于提取强关系图像特征，而且还可以生成场景图来明确描述成对关系。公开数据集的实验表明，我们的模型在图像字幕和场景图生成方面明显优于最先进的方法

更新日期：2021-07-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文