当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring Explicit and Implicit Visual Relationships for Image Captioning
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02391
Zeliang Song, Xiaofei Zhou

Image captioning is one of the most challenging tasks in AI, which aims to automatically generate textual sentences for an image. Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. However, these models usually lack the comprehensive understanding of the contextual interactions reflected on various visual relationships between objects. In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers (Region BERT) without extra relational annotations. To evaluate the effectiveness and superiority of our proposed method, we conduct extensive experiments on Microsoft COCO benchmark and achieve remarkable improvements compared with strong baselines.

中文翻译:

探索用于图像字幕的显式和隐式视觉关系

图像字幕是AI中最具挑战性的任务之一,其目的是自动为图像生成文本句子。用于图像字幕的最新方法遵循编码器-解码器框架,该框架将图像中的显着区域的序列转换为自然语言描述。但是,这些模型通常缺乏对对象之间各种视觉关系所反映的上下文交互的全面理解。在本文中,我们探索了显式和隐式的视觉关系,以丰富区域级表示形式的图像字幕。明确地说,我们在对象对上构建语义图,并利用门控图卷积网络(Gated GCN)来选择性地聚集本地邻居的信息。隐含地 我们通过来自变压器(区域BERT)的基于区域的双向编码器表示来绘制检测到的对象之间的全局交互,而无需额外的关系注释。为了评估我们提出的方法的有效性和优越性,我们在Microsoft COCO基准上进行了广泛的实验,并且与强基准相比取得了显着改进。
更新日期:2021-05-07
down
wechat
bug