Referring expression grounding by multi-context reasoning,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Referring expression grounding by multi-context reasoning
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2022-05-25 , DOI: 10.1016/j.patrec.2022.05.024
Xing Wang , De Xie , Yuanshi Zheng

Referring expression grounding plays a fundamental role in vision-language understanding, which aims at locating a certain target region in an image described by a natural language expression. It needs to understand high-level semantic correlations between objects in the image according to the referred expression for the task. Thus, it inherently requires reasoning the context information, i.e., appearance context and relationship context. While most existing approaches either ignore to explore the appearance details of the target region or rely on a manually designed reasoning structure and treat the context information of each neighboring object equivalently, inflexible to the scenario where referring expressions are complicated. In this paper, we put forward Multi-context Reasoning Network (MCRN) for referring expression grounding task, which can apply appearance context reasoning and relationship context reasoning simultaneously. Methodologically, for appearance context reasoning, we propose a local node attention to obtain local representation of the target object, which gives a more focus on its appearance details. For relationship context reasoning, we approach it as a language-guided multi-step reasoning problem and design a multi-step graph reasoning module to capture intra-context and inter-context between the target region of its intra-class and inter-class neighboring objects in an iterative way, which makes the reasoning process more reliable and interpretable. Our method demonstrates superiority based on extensive experimental outputs on three popular benchmark datasets.

中文翻译：

通过多上下文推理引用表达式接地

参考表达基础在视觉语言理解中起着基础性的作用，其目的是定位自然语言表达所描述的图像中的某个目标区域。它需要根据任务的引用表达来理解图像中对象之间的高级语义相关性。因此，它本质上需要推理上下文信息，即.，外观上下文和关系上下文。虽然大多数现有方法要么忽略探索目标区域的外观细节，要么依赖手动设计的推理结构并等效地处理每个相邻对象的上下文信息，但对于引用表达式复杂的场景不灵活。在本文中，我们提出了多上下文推理网络（MCRN）用于引用表达接地任务，它可以同时应用外观上下文推理和关系上下文推理。在方法论上，对于外观上下文推理，我们提出了一个局部节点注意力来获得目标对象的局部表示，这更加关注其外观细节。对于关系上下文推理，我们将其视为语言引导的多步推理问题，并设计了一个多步图推理模块，以迭代方式捕获其类内目标区域和类间相邻对象之间的上下文内和上下文间，这使得推理过程更加可靠和可解释。我们的方法基于三个流行的基准数据集上的广泛实验输出证明了优越性。

更新日期：2022-05-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11