Toward Region-Aware Attention Learning for Scene Graph Generation,IEEE Transactions on Neural Networks and Learning Systems

当前位置： X-MOL 学术 › IEEE Trans. Neural Netw. Learn. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Toward Region-Aware Attention Learning for Scene Graph Generation
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.2 ) Pub Date : 2021-06-21 , DOI: 10.1109/tnnls.2021.3086066
An-An Liu ₁ , Hongshuo Tian ₁ , Ning Xu ₁ , Weizhi Nie ₁ , Yongdong Zhang ₂ , Mohan Kankanhalli ₃

Affiliation

Scene graph generation (SGGen) is a challenging task due to a complex visual context of an image. Intuitively, the human visual system can volitionally focus on attended regions by salient stimuli associated with visual cues. For example, to infer the relationship between man and horse, the interaction between human leg and horseback can provide strong visual evidence to predict the predicate ride. Besides, the attended region face can also help to determine the object man. Till now, most of the existing works studied the SGGen by extracting coarse-grained bounding box features while understanding fine-grained visual regions received limited attention. To mitigate the drawback, this article proposes a region-aware attention learning method. The key idea is to explicitly construct the attention space to explore salient regions with the object and predicate inferences. First, we extract a set of regions in an image with the standard detection pipeline. Each region regresses to an object. Second, we propose the object-wise attention graph neural network (GNN), which incorporates attention modules into the graph structure to discover attended regions for object inference. Third, we build the predicate-wise co-attention GNN to jointly highlight subject’s and object’s attended regions for predicate inference. Particularly, each subject-object pair is connected with one of the latent predicates to construct one triplet. The proposed intra-triplet and inter-triplet learning mechanism can help discover the pair-wise attended regions to infer predicates. Extensive experiments on two popular benchmarks demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.

中文翻译：

面向场景图生成的区域感知注意力学习

由于图像的视觉上下文复杂，场景图生成（SGGen）是一项具有挑战性的任务。直观地讲，人类视觉系统可以通过与视觉线索相关的显着刺激来有意识地关注所关注的区域。例如，为了推断人与马之间的关系，人腿和马背之间的相互作用可以提供强有力的视觉证据来预测谓词骑行。此外，关注区域人脸还可以帮助确定对象人。到目前为止，大多数现有的工作都是通过提取粗粒度的边界框特征来研究 SGGen，同时理解细粒度的视觉区域，但受到的关注有限。为了减轻这个缺点，本文提出了一种区域感知的注意力学习方法。关键思想是明确构建注意力空间，以探索具有宾语和谓语推理的显着区域。首先，我们使用标准检测管道提取图像中的一组区域。每个区域都回归到一个对象。其次，我们提出了对象级注意图神经网络（GNN），它将注意模块合并到图结构中，以发现用于对象推理的关注区域。第三，我们构建了谓词共同注意 GNN，共同突出主语和客体的注意区域以进行谓词推理。特别地，每一主宾对与一个潜在谓词连接以构造一个三元组。所提出的三元组内和三元组间学习机制可以帮助发现成对参与区域以推断谓词。对两个流行基准的大量实验证明了所提出方法的优越性。额外的消融研究和可视化进一步验证了其有效性。

更新日期：2021-06-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11