Visual Grounding Via Accumulated Attention,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual Grounding Via Accumulated Attention
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2020-09-21 , DOI: 10.1109/tpami.2020.3023438
Chaorui Deng _{1,

2} , Qi Wu ₃ , Qingyao Wu ₁ , Fuyuan Hu ₄ , Fan Lyu ₅ , Mingkui Tan ₁

Affiliation

Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this “noised” training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.

中文翻译：

通过集中注意力建立视觉基础

视觉接地 (VG) 旨在基于自然语言查询定位图像中最相关的对象或区域。一般来说，它需要机器先理解查询，识别图像中的关键概念，然后通过指定其边界框来定位目标对象。然而，在许多现实世界的视觉基础应用中，我们不得不面对模棱两可的查询和具有复杂场景结构的图像。基于高度冗余和相关的信息识别目标可能非常具有挑战性，并且通常会导致性能不理想。为了解决这个问题，在本文中，我们为每种信息利用了一个注意力模块来减少内部冗余。然后，我们提出了一种累积注意力（A-ATT）机制来共同在所有注意力模块之间进行推理。这样，可以明确地捕获不同类型信息之间的关系。此外，为了提高我们的 VG 模型的性能和鲁棒性，我们还在训练过程中引入了一些噪声，以弥合人工标记的训练数据与现实世界中质量较差的数据之间的分布差距。通过这种“噪声”训练策略，我们可以进一步学习一个边界框回归器，它可以用来细化目标对象的边界框。我们在四个流行的数据集（即，ReferCOCO、ReferCOCO+、ReferCOCOg 和 GuessWhat？！）上评估提出的方法。实验结果表明，我们的方法在准确性方面明显优于以前在每个数据集上的所有工作。我们还在训练过程中引入了一些噪声，以弥合人工标记的训练数据与现实世界中质量较差的数据之间的分布差距。通过这种“噪声”训练策略，我们可以进一步学习一个边界框回归器，它可以用来细化目标对象的边界框。我们在四个流行的数据集（即，ReferCOCO、ReferCOCO+、ReferCOCOg 和 GuessWhat？！）上评估提出的方法。实验结果表明，我们的方法在准确性方面明显优于以前在每个数据集上的所有工作。我们还在训练过程中引入了一些噪声，以弥合人工标记的训练数据与现实世界中质量较差的数据之间的分布差距。通过这种“噪声”训练策略，我们可以进一步学习一个边界框回归器，它可以用来细化目标对象的边界框。我们在四个流行的数据集（即，ReferCOCO、ReferCOCO+、ReferCOCOg 和 GuessWhat？！）上评估提出的方法。实验结果表明，我们的方法在准确性方面明显优于以前在每个数据集上的所有工作。可用于细化目标对象的边界框。我们在四个流行的数据集（即，ReferCOCO、ReferCOCO+、ReferCOCOg 和 GuessWhat？！）上评估提出的方法。实验结果表明，我们的方法在准确性方面明显优于以前在每个数据集上的所有工作。可用于细化目标对象的边界框。我们在四个流行的数据集（即，ReferCOCO、ReferCOCO+、ReferCOCOg 和 GuessWhat？！）上评估提出的方法。实验结果表明，我们的方法在准确性方面明显优于以前在每个数据集上的所有工作。

更新日期：2020-09-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>