当前位置:
X-MOL 学术
›
arXiv.cs.CL
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Multimodal Pretraining for Dense Video Captioning
arXiv - CS - Computation and Language Pub Date : 2020-11-10 , DOI: arxiv-2011.11760 Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut
arXiv - CS - Computation and Language Pub Date : 2020-11-10 , DOI: arxiv-2011.11760 Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut
Learning specific hands-on skills such as cooking, car maintenance, and home
repairs increasingly happens via instructional videos. The user experience with
such videos is known to be improved by meta-information such as time-stamped
annotations for the main steps involved. Generating such annotations
automatically is challenging, and we describe here two relevant contributions.
First, we construct and release a new dense video captioning dataset, Video
Timeline Tags (ViTT), featuring a variety of instructional videos together with
time-stamped annotations. Second, we explore several multimodal
sequence-to-sequence pretraining strategies that leverage large unsupervised
datasets of videos and caption-like texts. We pretrain and subsequently
finetune dense video captioning models using both YouCook2 and ViTT. We show
that such models generalize well and are robust over a wide variety of
instructional videos.
中文翻译:
多模式预训练用于密集视频字幕
通过教学视频越来越多地学习特定的动手技能,例如烹饪,汽车维修和房屋维修。已知可以通过元信息(例如,所涉及的主要步骤的时间戳注释)来改善此类视频的用户体验。自动生成此类注释具有挑战性,在此我们描述两个相关的贡献。首先,我们构建并发布一个新的密集视频字幕数据集,即视频时间线标签(ViTT),其中包含各种教学视频以及带时间戳的注释。其次,我们探索几种多模式序列到序列的预训练策略,这些策略利用了视频和字幕类文本的大型无监督数据集。我们使用YouCook2和ViTT预训练并随后微调密集视频字幕模型。
更新日期:2020-11-25
中文翻译:
多模式预训练用于密集视频字幕
通过教学视频越来越多地学习特定的动手技能,例如烹饪,汽车维修和房屋维修。已知可以通过元信息(例如,所涉及的主要步骤的时间戳注释)来改善此类视频的用户体验。自动生成此类注释具有挑战性,在此我们描述两个相关的贡献。首先,我们构建并发布一个新的密集视频字幕数据集,即视频时间线标签(ViTT),其中包含各种教学视频以及带时间戳的注释。其次,我们探索几种多模式序列到序列的预训练策略,这些策略利用了视频和字幕类文本的大型无监督数据集。我们使用YouCook2和ViTT预训练并随后微调密集视频字幕模型。