当前位置: X-MOL 学术arXiv.cs.RO › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment
arXiv - CS - Robotics Pub Date : 2021-01-19 , DOI: arxiv-2101.07891
Homagni Saha, Fateme Fotouhif, Qisai Liu, Soumik Sarkar

In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long-horizon, compositional tasks over the baseline on the recently released benchmark data set-ALFRED.

中文翻译:

用于室内环境中长距离合成任务的模块化视觉语言导航和操作框架

在本文中,我们提出了一个新的框架-MoViLan(模块化视觉和语言),用于执行可视化的自然语言指令以执行日常室内家庭任务。虽然已经提出了几种基于视觉和语言模式的针对目标导航任务的数据驱动的端到端学习框架,但基于最新基准数据集的性能揭示了开发用于远距离构图任务(涉及操纵)的综合技术的差距和导航),其中包含各种对象类别,逼真的指令和具有不可逆状态变化的视觉场景。我们提出了一种模块化的方法来处理组合的导航和对象交互问题,而无需严格对齐的视觉和语言训练数据(例如,以专家演示的轨迹形式)。这种方法与该领域的传统端到端技术大相径庭,并允许使用单独的视觉和语言数据集进行更易于处理的培训过程。具体来说,我们提出了一种用于杂乱的室内环境的新颖的几何感知映射技术,以及一种针对家庭指导的通用语言理解模型。在最近发布的基准数据集-ALFRED上,我们证明了长期,合成任务的成功率大大超过了基线。以及针对家庭教学的通用语言理解模型。在最近发布的基准数据集-ALFRED上,我们证明了长期,合成任务的成功率大大超过了基线。以及针对家庭教学的语言理解模型。在最近发布的基准数据集-ALFRED上,我们证明了长期,合成任务的成功率大大超过了基线。
更新日期:2021-01-21
down
wechat
bug