Delving Deeper into the Decoder for Video Captioning,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Delving Deeper into the Decoder for Video Captioning
arXiv - CS - Computation and Language Pub Date : 2020-01-16 , DOI: arxiv-2001.05614
Haoran Chen, Jianmin Li and Xiaolin Hu

Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there exist some problems in the decoder of a video captioning model. We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model. First of all, a combination of variational dropout and layer normalization is embedded into a recurrent unit to alleviate the problem of overfitting. Secondly, a new online method is proposed to evaluate the performance of a model on a validation set so as to select the best checkpoint for testing. Finally, a new training strategy called professional learning is proposed which uses the strengths of a captioning model and bypasses its weaknesses. It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets that our model has achieved the best results evaluated by BLEU, CIDEr, METEOR and ROUGE-L metrics with significant gains of up to 18% on MSVD and 3.5% on MSR-VTT compared with the previous state-of-the-art models.

中文翻译：

深入研究视频字幕解码器

视频字幕是一项先进的多模态任务，旨在使用自然语言句子描述视频剪辑。编码器-解码器框架是近年来此任务最流行的范式。然而，视频字幕模型的解码器存在一些问题。我们对解码器进行了深入研究，并采用了三种技术来提高模型的性能。首先，将变分 dropout 和层归一化的组合嵌入到循环单元中以缓解过拟合问题。其次，提出了一种新的在线方法来评估模型在验证集上的性能，以选择最佳检查点进行测试。最后，提出了一种称为专业学习的新培训策略，该策略利用字幕模型的优势并绕过其弱点。

更新日期：2020-02-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文