当前位置:
X-MOL 学术
›
arXiv.cs.CV
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Exploring Explicit and Implicit Visual Relationships for Image Captioning
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02391 Zeliang Song, Xiaofei Zhou
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02391 Zeliang Song, Xiaofei Zhou
Image captioning is one of the most challenging tasks in AI, which aims to
automatically generate textual sentences for an image. Recent methods for image
captioning follow encoder-decoder framework that transforms the sequence of
salient regions in an image into natural language descriptions. However, these
models usually lack the comprehensive understanding of the contextual
interactions reflected on various visual relationships between objects. In this
paper, we explore explicit and implicit visual relationships to enrich
region-level representations for image captioning. Explicitly, we build
semantic graph over object pairs and exploit gated graph convolutional networks
(Gated GCN) to selectively aggregate local neighbors' information. Implicitly,
we draw global interactions among the detected objects through region-based
bidirectional encoder representations from transformers (Region BERT) without
extra relational annotations. To evaluate the effectiveness and superiority of
our proposed method, we conduct extensive experiments on Microsoft COCO
benchmark and achieve remarkable improvements compared with strong baselines.
中文翻译:
探索用于图像字幕的显式和隐式视觉关系
图像字幕是AI中最具挑战性的任务之一,其目的是自动为图像生成文本句子。用于图像字幕的最新方法遵循编码器-解码器框架,该框架将图像中的显着区域的序列转换为自然语言描述。但是,这些模型通常缺乏对对象之间各种视觉关系所反映的上下文交互的全面理解。在本文中,我们探索了显式和隐式的视觉关系,以丰富区域级表示形式的图像字幕。明确地说,我们在对象对上构建语义图,并利用门控图卷积网络(Gated GCN)来选择性地聚集本地邻居的信息。隐含地 我们通过来自变压器(区域BERT)的基于区域的双向编码器表示来绘制检测到的对象之间的全局交互,而无需额外的关系注释。为了评估我们提出的方法的有效性和优越性,我们在Microsoft COCO基准上进行了广泛的实验,并且与强基准相比取得了显着改进。
更新日期:2021-05-07
中文翻译:
探索用于图像字幕的显式和隐式视觉关系
图像字幕是AI中最具挑战性的任务之一,其目的是自动为图像生成文本句子。用于图像字幕的最新方法遵循编码器-解码器框架,该框架将图像中的显着区域的序列转换为自然语言描述。但是,这些模型通常缺乏对对象之间各种视觉关系所反映的上下文交互的全面理解。在本文中,我们探索了显式和隐式的视觉关系,以丰富区域级表示形式的图像字幕。明确地说,我们在对象对上构建语义图,并利用门控图卷积网络(Gated GCN)来选择性地聚集本地邻居的信息。隐含地 我们通过来自变压器(区域BERT)的基于区域的双向编码器表示来绘制检测到的对象之间的全局交互,而无需额外的关系注释。为了评估我们提出的方法的有效性和优越性,我们在Microsoft COCO基准上进行了广泛的实验,并且与强基准相比取得了显着改进。