当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
arXiv - CS - Computation and Language Pub Date : 2020-06-30 , DOI: arxiv-2006.16934
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.

中文翻译:

ERNIE-ViL:通过场景图的知识增强视觉语言表示

我们提出了一种知识增强方法 ERNIE-ViL,以学习视觉和语言的联合表示。ERNIE-ViL 试图构建跨视觉和语言的详细语义连接(视觉场景中的对象、对象的属性和对象之间的关系),这对于视觉语言跨模态任务至关重要。ERNIE-ViL 结合场景图的知识,在预训练阶段构建场景图预测任务,即对象预测、属性预测和关系预测。更具体地说,这些预测任务是通过预测从句子中解析出来的场景图中不同类型的节点来实现的。因此,ERNIE-ViL 可以对表征跨视觉和语言的详细语义对齐的联合表示进行建模。ERNIE-ViL 在两个大型图像文本对齐数据集(Conceptual Captions 和 SBU)上进行了预训练,可以学习更好、更稳健的联合表示。在微调 ERNIE-ViL 后,它在 5 个视觉语言下游任务上实现了最先进的性能。此外,它在 VCR 排行榜上排名第一,绝对提升 3.7%。
更新日期:2020-08-03
down
wechat
bug