当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Video-Grounded Dialogues with Pretrained Generation Language Models
arXiv - CS - Computation and Language Pub Date : 2020-06-27 , DOI: arxiv-2006.15319
Hung Le, Steven C.H. Hoi

Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.

中文翻译:

带有预训练生成语言模型的基于视频的对话

由于预训练语言模型能够捕获文本数据中的依赖关系并生成自然响应,因此它们在改进各种下游 NLP 任务方面取得了显着的成功。在本文中,我们利用预训练语言模型的力量来改进基于视频的对话,这非常具有挑战性,并且涉及不同动态的复杂特征:(1)可以跨越空间和时间维度的视频特征;(2) 涉及多个对话轮次的语义依赖的对话特征。我们提出了一个框架,通过扩展 GPT-2 模型来解决这些挑战,方法是将基于视频的对话任务制定为序列到序列的任务,将视觉和文本表示结合到结构化序列中,并微调大型预训练GPT-2 网络。我们的框架允许微调语言模型,以捕获不同信息级别的多种模态之间的依赖关系:视频中的时空级别和对话上下文中的标记句子级别。我们在 DSTC7 的视听场景感知对话 (AVSD) 基准上取得了有希望的改进,这支持了这一系列研究的潜在方向。
更新日期:2020-06-30
down
wechat
bug