当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Polysemy Deciphering Network for Robust Human–Object Interaction Detection
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2021-04-19 , DOI: 10.1007/s11263-021-01458-8
Xubin Zhong , Changxing Ding , Xian Qu , Dacheng Tao

Human–Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemy-aware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module, which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOI-VP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper are available at https://github.com/MuchHair/PD-Net.



中文翻译:

多义解密网络,用于可靠的人机交互检测

人与物体的交互(HOI)检测对于以人为中心的场景理解任务很重要。现有作品倾向于假定同一动词在不同的HOI类别中具有相似的视觉特征,这种方法忽略了该动词的多种语义。为了解决这个问题,在本文中,我们提出了一种新颖的多义解密网络(PD-Net),它以三种不同的方式对动词的视觉多义进行解码以进行HOI检测。首先,我们通过使用两个新颖的模块将HOI检测的特征改进为多义性感知:即语言优先引导的频道注意(LPCA)和基于语言优先的特征增强(LPFA)。LPCA突出显示了要识别的每个HOI类别的人和物体外观特征中的重要元素;而且,LPFA使用语言先验增强了人类姿势和空间特征以进行HOI检测,使动词分类器能够接收语言提示,从而减少同一动词的类内变异。其次,我们引入了一种新颖的“多义性感知模态融合”模块,该模块可指导PD-Net根据语言先验,根据被认为更重要的特征类型做出决策。第三,我们建议通过共享语义相似的HOI类别的动词分类器来减轻动词多义性问题。此外,为了加快对动词多义性问题的研究,我们建立了一个名为HOI-VerbPolysemy(HOI-VP)的新基准数据集,其中包括在现实世界中具有多种语义的普通动词(谓词)。最后,通过解读动词的视觉多义性,在HICO-DET,V-COCO和HOI-VP数据库上,我们的方法被证明优于最新方法。本文中的代码和数据可从https://github.com/MuchHair/PD-Net获得。

更新日期:2021-04-19
down
wechat
bug