Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 5-5-2020 , DOI: 10.1109/tpami.2020.2992222
Zih-Siou Hung , Arun Mallya , Svetlana Lazebnik

Relations amongst entities play a central role in image understanding. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE [1] , we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate ≈\approx union (subject, object) _- subject _- object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.

中文翻译：

用于视觉关系检测和场景图生成的上下文翻译嵌入

实体之间的关系在图像理解中起着核心作用。由于建模（主语、谓语、宾语）关系三元组的复杂性，开发一种不仅可以识别已见关系，而且可以推广到未见情况的方法至关重要。受先前提出的视觉翻译嵌入模型（VTransE [1]）的启发，我们提出了一种上下文增强翻译嵌入模型，可以捕获常见和罕见的关系。先前的 VTransE 模型将实体和谓词映射到低维嵌入向量空间，其中谓词被解释为主语和客体的边界框区域的嵌入特征之间的平移向量。我们的模型还结合了主语和客体联合的边界框捕获的上下文信息，并学习由约束谓词 ≈\approx union (subject, object) _- subject _- object 指导的嵌入。在对多个具有挑战性的基准的综合评估中，我们的方法优于以前的基于翻译的模型，并且在从小规模到大规模数据集、从常见到以前未见过的关系的一系列设置中接近或超过了最先进的水平。它还为最近引入的场景图生成任务取得了有希望的结果。

更新日期：2024-08-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11