当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring region relationships implicitly: Image captioning with visual relationship attention
Image and Vision Computing ( IF 4.7 ) Pub Date : 2021-03-05 , DOI: 10.1016/j.imavis.2021.104146
Zongjian Zhang , Qiang Wu , Yang Wang , Fang Chen

Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly addressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can generate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relationship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions.



中文翻译:

隐式探索区域关系:带有视觉关系注意的图像字幕

图像标注模型已广泛使用视觉注意机制,以便基于给定的语言信息动态关注相关的视觉区域。这种功能允许训练有素的模型执行细粒度的图像理解和推理。但是,现有的视觉注意力模型仅关注图像中的单个视觉区域以及语言表示和相关的单个视觉区域之间的对齐方式。它没有完全探讨视觉区域之间的关系/交互。此外,它没有分析或探讨相关词语/短语(例如动词或短语动词)的对齐方式,这些词语/短语可能最好地描述了这些视觉区域之间的关系/互动。因此,这导致对当前图像字幕模型的描述不正确或不正确。代替现有视觉注意力机制通常解决的视觉区域注意力,本文通过针对各个区域的上下文化嵌入提出了新颖的视觉关系注意力。当生成交互词时,它可以动态地探索多个区域之间存在的相关视觉关系。这种关系探索过程受到空间关系的约束,并受到语言解码器的语言环境的驱动。在这项工作中,通过在学习的空间约束下通过并行注意机制设计这种新的视觉关系注意,以便更精确地将视觉关系信息映射到语言中这种关系的语义描述。与探索视觉关系的现有方法不同,它是通过无监督的方法进行隐式训练的,而无需使用任何显式的视觉关系注释。通过将新提出的视觉关系注意力与现有的视觉区域注意力进行整合,我们的图像字幕模型可以生成高质量的字幕。在MSCOCO数据集上进行的可靠实验表明,所提出的视觉关系注意力可以通过捕获相关的视觉关系来生成准确的交互描述,从而有效地提高字幕的性能。

更新日期:2021-03-16
down
wechat
bug