当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2020-03-03 , DOI: 10.1016/j.patrec.2020.03.001
Huanhou Xiao , Jinglun Shi

Automatically describing video content with natural language has been attracting much attention in computer vision and natural language processing communities. Most existing methods predict one word at a time, and by feeding the last generated word back as input at the next time, while the other generated words are not fully exploited. Furthermore, traditional methods optimize the model using all the training samples in each epoch without considering their learning situations, which leads to a lot of unnecessary training and can not target the difficult samples. To address these issues, we propose a text-based dynamic attention model named TDAM, which imposes a dynamic attention mechanism on all the generated words with the motivation to improve the context semantic information and enhance the overall control of the whole sentence. Moreover, the text-based dynamic attention mechanism and the visual attention mechanism are linked together to focus on the important words. They can benefit from each other during training. In addition, the model is trained through two steps: “starting from scratch” and “checking for gaps”. The former uses all the samples to optimize the model, while the latter only trains for samples with poor control. Experimental results on the popular datasets MSVD and MSR-VTT demonstrate that our non-ensemble model outperforms the state-of-the-art video captioning benchmarks.



中文翻译:

基于文本的动态注意力和逐步学习的视频字幕

用自然语言自动描述视频内容已在计算机视觉和自然语言处理社区中引起了广泛关注。大多数现有方法一次预测一个单词,并在下一次反馈最后一个生成的单词作为输入,而其他生成的单词未被充分利用。此外,传统方法在每个时期都使用所有训练样本来优化模型,而不考虑其学习情况,这导致了很多不必要的训练,并且无法针对困难样本。为了解决这些问题,我们提出了一个基于文本的动态注意力模型TDAM,该模型对所有生成的单词都施加了动态注意力机制,目的是改善上下文语义信息并增强整个句子的整体控制。此外,基于文本的动态注意力机制和视觉注意力机制被链接在一起以关注重要单词。他们可以在培训期间互相受益。此外,该模型通过两个步骤进行训练:“从头开始”和“检查差距”。前者使用所有样本优化模型,而后者仅训练控制不佳的样本。在流行的数据集MSVD和MSR-VTT上的实验结果表明,我们的非集成模型优于最新的视频字幕基准。前者使用所有样本优化模型,而后者仅训练控制不佳的样本。在流行的数据集MSVD和MSR-VTT上的实验结果表明,我们的非集成模型优于最新的视频字幕基准。前者使用所有样本优化模型,而后者仅训练控制不佳的样本。在流行的数据集MSVD和MSR-VTT上的实验结果表明,我们的非集成模型优于最新的视频字幕基准。

更新日期:2020-03-07
down
wechat
bug