Multimodal attention networks for low-level vision-and-language navigation,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multimodal attention networks for low-level vision-and-language navigation
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2021-07-29 , DOI: 10.1016/j.cviu.2021.103255
Federico Landi ₁ , Lorenzo Baraldi ₁ , Marcella Cornia ₁ , Massimiliano Corsini ₂ , Rita Cucchiara ₁

Affiliation

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise “Perceive, Transform, and Act” (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities — natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent’s history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark.

中文翻译：

用于低级视觉和语言导航的多模态注意力网络

视觉和语言导航 (VLN) 是一项具有挑战性的任务，其中代理需要遵循语言指定的路径才能到达目标目的地。随着代理可用的操作变得更简单并转向与环境的低级原子交互，目标变得更加困难。此设置采用低级 VLN 的名称。在本文中，我们努力创建一个能够解决三个关键问题的代理：多模态、长期依赖和对不同机车设置的适应性。为此，我们设计了“感知、转换和行动”（PTA）：一个完全专注的 VLN 架构，将循环方法抛在脑后，第一个类似于 Transformer 的架构融合了三种不同的模式——自然语言、图像和低代理控件的级别操作。特别是，我们采用早期融合策略在我们的编码器中有效地合并语言和视觉信息。然后，我们建议通过代理的动作历史和感知模式之间的后期融合扩展来改进解码阶段。我们在两个数据集上通过实验验证了我们的模型：PTA 在 R2R 上的低级 VLN 中取得了有希望的结果，并在最近提出的 R4R 基准测试中取得了良好的性能。

更新日期：2021-08-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>