Transformer-based vision-language alignment for robot navigation and question answering,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformer-based vision-language alignment for robot navigation and question answering
Information Fusion ( IF 14.7 ) Pub Date : 2024-03-11 , DOI: 10.1016/j.inffus.2024.102351
Haonan Luo , Ziyu Guo , Zhenyu Wu , Fei Teng , Tianrui Li

The task of robot navigation and question answering, which is also known as Embodied Question Answering (EQA), places its emphasis on empowering agents to actively explore their environments and deliver answers to user inquiries. Considering the extensive range of potential applications, particularly in the realms of home robots and personal assistants, the Embodied Question Answering task has attracted growing attention from researchers. Owing to the difficulties in bridging the semantic divide between inputs from different modalities and capturing extended connections between widely separated pixels, most existing methods face challenges in attaining adequate performance concerning accuracy in navigation and responses. To address these challenges, we present a transformer-based framework that aligns vision and language information for the task of robot navigation and question answering. Firstly, an information fusion model is designed to utilize object tags as reference points for aligning the vision and linguistic modalities into a coherent semantic space. Secondly, a dedicated transformer block is employed to capture extensive dependencies within visual scenes, enabling the generation of more contextually appropriate actions. Lastly, the two transformer-based components are seamlessly integrated into a cohesive framework, effectively handling the complete Embodied Question Answering task. The results of our experiments clearly indicate that our approach substantially boosts the performance of every module in the system, resulting in a notable 4.1% enhancement in overall accuracy.

中文翻译：

基于 Transformer 的视觉语言对齐，用于机器人导航和问答

机器人导航和问答任务，也称为嵌入式问答（EQA），其重点是使智能体能够主动探索其环境并为用户询问提供答案。考虑到广泛的潜在应用，特别是在家庭机器人和个人助理领域，具体问答任务吸引了研究人员越来越多的关注。由于弥合不同模态输入之间的语义鸿沟以及捕获广泛分离的像素之间的扩展连接存在困难，大多数现有方法在获得有关导航和响应准确性的足够性能方面面临挑战。为了应对这些挑战，我们提出了一个基于变压器的框架，该框架可以为机器人导航和问答任务调整视觉和语言信息。首先，设计信息融合模型，利用对象标签作为参考点，将视觉和语言模态对齐到连贯的语义空间。其次，采用专用的变压器块来捕获视觉场景中的广泛依赖性，从而能够生成更适合上下文的动作。最后，两个基于变压器的组件无缝集成到一个有凝聚力的框架中，有效地处理完整的体现问答任务。我们的实验结果清楚地表明，我们的方法极大地提高了系统中每个模块的性能，整体精度显着提高了 4.1%。

更新日期：2024-03-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11