当前位置: X-MOL 学术Spatial Cognition & Computation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi Spatial Relation Detection in Images
Spatial Cognition & Computation ( IF 1.6 ) Pub Date : 2021-08-04 , DOI: 10.1080/13875868.2021.1957897
Brandon Birmingham 1 , Adrian Muscat 1
Affiliation  

ABSTRACT

Detecting spatial relationships between objects depicted in an image is an important sub-task in vision and language understanding. Its practical use lies in visual discourse when referring to objects by their relationship in context of others and finds application in higher level tasks such as visual question answering and image description generation. Presumably, the selection of spatial prepositions grounded in an image is straightforward. However, in general, human beings either do not always agree or are not consistent when choosing spatial prepositions. This could be due to various reasons, such as near synonyms, overlapping terms and different frames of reference. For these reasons, the automatic detection of spatial relations is a non-trivial multi-label problem. This paper addresses the automatic multi-selection of prepositions. The study is based on the development of a number of machine learning models, namely Nearest Neighbor (NN), k-Means Clustering (kM-C), Agglomerative Hierarchical Clustering (A-HC) and Multi-label Neural Network (ML-NN). The model performances are compared quantitatively using multi-label metrics as well as human evaluations that are independent of the ground truth labels. Additionally, the classification results are used as a basis to carry out an error and qualitative analysis that sheds light on the relative merits of how each model deals with synonymous and overlapping relations, and groups common errors to inform future directions. Furthermore, to gain insight into the merits of multi-label models, a single-label Random Forest (RF) classifier is developed and its results are included in the analysis. Of all multi-label models, the ML-NN exhibits the best overall performance when evaluated on both the dataset ground truth and the independent human evaluations. It, however, suffers from under-generating prepositions, while the rest of the models often generate more prepositions at the expense of precision. The clustering-based methods are also not quite consistent, although they do better than the other models in less frequent spatial configurations that other models struggle with. The results from the single-label RF classifier highlight the usefulness of having a multi-label model. Finally, the error analysis indicates that the majority of errors is due to lack of features that give cues on object position and orientation (object pose), the fixed frame of reference, and the failure to resolve depth in perspective view.



中文翻译:

图像中的多空间关系检测

摘要

检测图像中描绘的对象之间的空间关系是视觉和语言理解中的一个重要子任务。它的实际用途在于在视觉话语中通过它们在其他上下文中的关系来引用对象,并在更高级别的任务中找到应用,例如视觉问答和图像描述生成。据推测,以图像为基础的空间介词的选择很简单。然而,总的来说,人类在选择空间介词时要么并不总是一致,要么不一致。这可能是由于各种原因造成的,例如近义词、重叠术语和不同的参考框架。由于这些原因,空间关系的自动检测是一个重要的多标签问题。本文讨论了介词的自动多重选择。ķ-均值聚类 (kM-C)、凝聚层次聚类 (A-HC) 和多标签神经网络 (ML-NN)。使用多标签指标以及独立于基本事实标签的人工评估对模型性能进行定量比较。此外,分类结果被用作进行错误和定性分析的基础,以阐明每个模型如何处理同义和重叠关系的相对优点,并将常见错误分组以告知未来方向。此外,为了深入了解多标签模型的优点,开发了一种单标签随机森林 (RF) 分类器,并将其结果包含在分析中。在所有多标签模型中,ML-NN 在对数据集基本事实和独立的人工评估进行评估时表现出最佳的整体性能。然而,它存在介词生成不足的问题,而其他模型通常会以牺牲精度为代价生成更多的介词。基于聚类的方法也不太一致,尽管它们在其他模型难以解决的不太频繁的空间配置中比其他模型做得更好。单标签 RF 分类器的结果突出了拥有多标签模型的有用性。最后,错误分析表明,大多数错误是由于缺乏提供物体位置和方向(物体姿势)线索的特征、固定参考框架以及未能解析透视图中的深度。遭受介词生成不足的困扰,而其余模型通常会以牺牲精度为代价生成更多的介词。基于聚类的方法也不太一致,尽管它们在其他模型难以解决的不太频繁的空间配置中比其他模型做得更好。单标签 RF 分类器的结果突出了拥有多标签模型的有用性。最后,错误分析表明,大多数错误是由于缺乏提供物体位置和方向(物体姿势)线索的特征、固定参考框架以及未能解析透视图中的深度。遭受介词生成不足的困扰,而其余模型通常会以牺牲精度为代价生成更多的介词。基于聚类的方法也不太一致,尽管它们在其他模型难以解决的不太频繁的空间配置中比其他模型做得更好。单标签 RF 分类器的结果突出了拥有多标签模型的有用性。最后,错误分析表明,大多数错误是由于缺乏提供物体位置和方向(物体姿势)线索的特征、固定参考框架以及未能解析透视图中的深度。尽管在其他模型难以解决的不太频繁的空间配置中,它们比其他模型做得更好。单标签 RF 分类器的结果突出了拥有多标签模型的有用性。最后,错误分析表明,大多数错误是由于缺乏提供物体位置和方向(物体姿势)线索的特征、固定参考框架以及未能解析透视图中的深度。尽管在其他模型难以解决的不太频繁的空间配置中,它们比其他模型做得更好。单标签 RF 分类器的结果突出了拥有多标签模型的有用性。最后,错误分析表明,大多数错误是由于缺乏提供物体位置和方向(物体姿势)线索的特征、固定参考框架以及未能解析透视图中的深度。

更新日期:2021-08-04
down
wechat
bug