X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-09-23 , DOI: arxiv-2009.11278
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.

中文翻译：

X-LXMERT：使用多模态变压器绘制、字幕和回答问题

镜像语言模型的成功，ViLBERT、LXMERT 和 UNITER 等视觉和语言模型已经在各种多模态判别任务（如视觉问答和视觉基础）上取得了最先进的性能。最近的工作也成功地将这些模型应用于图像字幕的生成任务。这就引出了一个问题：这些模型能否走另一条路并从文本片段生成图像？我们对来自该模型系列的一个流行代表的分析 - LXMERT - 发现它无法使用当前的训练设置生成丰富且具有语义意义的图像。我们介绍了 X-LXMERT，它是 LXMERT 的扩展，具有训练改进，包括：离散化视觉表示、使用具有大范围掩蔽率的统一掩蔽，并将正确的预训练数据集与正确的目标对齐，使其能够进行绘制。X-LXMERT 的图像生成能力可与最先进的生成模型相媲美，而其问答和字幕能力仍可与 LXMERT 相媲美。最后，我们通过将图像生成功能添加到 UNITER 以生成 X-UNITER 来证明这些训练改进的通用性。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>