当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
All-in-One Image-Grounded Conversational Agents
arXiv - CS - Computation and Language Pub Date : 2019-12-28 , DOI: arxiv-1912.12394
Da Ju, Kurt Shuster, Y-Lan Boureau, Jason Weston

As single-task accuracy on individual language and image tasks has improved substantially in the last few years, the long-term goal of a generally skilled agent that can both see and talk becomes more feasible to explore. In this work, we focus on leveraging individual language and image tasks, along with resources that incorporate both vision and language towards that objective. We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module to produce a combined model trained on many tasks. We provide a thorough analysis of the components of the model, and transfer performance when training on one, some, or all of the tasks. Our final models provide a single system that obtains good results on all vision and language tasks considered, and improves the state-of-the-art in image-grounded conversational applications.

中文翻译:

多合一基于图像的会话代理

随着过去几年单个语言和图像任务的单任务准确性有了显着提高,一个既能看又能说话的普通智能体的长期目标变得更加可行。在这项工作中,我们专注于利用个人语言和图像任务,以及结合视觉和语言实现该目标的资源。我们设计了一种架构,将最先进的 Transformer 和 ResNeXt 模块结合到一个新颖的细心多模态模块中,以生成在许多任务上训练的组合模型。我们提供对模型组件的全面分析,并在对一项、部分或所有任务进行训练时转移性能。我们的最终模型提供了一个单一系统,可以在所有考虑的视觉和语言任务上获得良好的结果,
更新日期:2020-01-17
down
wechat
bug