当前位置: X-MOL 学术Front. Neurorobotics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Intention-Related Natural Language Grounding via Object Affordance Detection and Intention Semantic Extraction.
Frontiers in Neurorobotics ( IF 3.1 ) Pub Date : 2020-04-09 , DOI: 10.3389/fnbot.2020.00026
Jinpeng Mi 1, 2 , Hongzhuo Liang 2 , Nikolaos Katsakis 3 , Song Tang 1, 2 , Qingdu Li 1 , Changshui Zhang 4 , Jianwei Zhang 2
Affiliation  

Similar to specific natural language instructions, intention-related natural language queries also play an essential role in our daily life communication. Inspired by the psychology term “affordance” and its applications in Human-Robot interaction, we propose an object affordance-based natural language visual grounding architecture to ground intention-related natural language queries. Formally, we first present an attention-based multi-visual features fusion network to detect object affordances from RGB images. While fusing deep visual features extracted from a pre-trained CNN model with deep texture features encoded by a deep texture encoding network, the presented object affordance detection network takes into account the interaction of the multi-visual features, and reserves the complementary nature of the different features by integrating attention weights learned from sparse representations of the multi-visual features. We train and validate the attention-based object affordance recognition network on a self-built dataset in which a large number of images originate from MSCOCO and ImageNet. Moreover, we introduce an intention semantic extraction module to extract intention semantics from intention-related natural language queries. Finally, we ground intention-related natural language queries by integrating the detected object affordances with the extracted intention semantics. We conduct extensive experiments to validate the performance of the object affordance detection network and the intention-related natural language queries grounding architecture.



中文翻译:

通过对象负担检测和意图语义提取实现与意图相关的自然语言基础。

与特定的自然语言指令相似,与意图相关的自然语言查询在我们的日常交流中也起着至关重要的作用。受心理学术语“负担”及其在人机交互中的应用的启发,我们提出了一种基于对象负担的自然语言可视化基础架构,以对意图相关的自然语言进行查询。形式上,我们首先提出一个基于注意力的多视觉特征融合网络,以检测RGB图像中的对象能力。在将从预先训练的CNN模型中提取的深层视觉特征与由深层纹理编码网络编码的深层纹理特征融合在一起的同时,提出的对象可承受能力检测网络会考虑到多层视觉特征的相互作用,并通过整合从稀疏表示的多视觉特征中获得的注意力权重来保留不同特征的互补性质。我们在自建数据集中训练并验证基于注意力的对象提供能力识别网络,在该数据集中大量图像来自MSCOCO和ImageNet。此外,我们引入了意图语义提取模块,以从与意图相关的自然语言查询中提取意图语义。最后,我们通过将检测到的对象能力与提取的意图语义相集成,来实现与意图相关的自然语言查询。我们进行了广泛的实验,以验证对象提供能力检测网络和与意图相关的自然语言查询基础体系结构的性能。

更新日期:2020-04-09
down
wechat
bug