Deep Reinforcement Polishing Network for Video Captioning,IEEE Transactions on Multimedia

当前位置： X-MOL 学术 › IEEE Trans. Multimedia › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Reinforcement Polishing Network for Video Captioning
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-06-15 , DOI: 10.1109/tmm.2020.3002669
Wanru Xu , Jian Yu , Zhenjiang Miao , Lili Wan , Yi Tian , Qiang Ji

The video captioning task aims to describe video content using several natural-language sentences. Although one-step encoder-decoder models have achieved promising progress, the generations always involve many errors, which are mainly caused by the large semantic gap between the visual domain and the language domain and by the difficulty in long-sequence generation. The underlying challenge of video captioning, i.e., sequence-to-sequence mapping across different domains, is still not well handled. Inspired by the proofreading procedure of human beings, the generated caption can be gradually polished to improve its quality. In this paper, we propose a deep reinforcement polishing network (DRPN) to refine the caption candidates, which consists of a word-denoising network (WDN) to revise word errors and a grammar-checking network (GCN) to revise grammar errors. On the one hand, the long-term reward in deep reinforcement learning benefits the long-sequence generation, which takes the global quality of caption sentences into account. On the other hand, the caption candidate can be considered a bridge between visual and language domains, where the semantic gap is gradually reduced with better candidates generated by repeated revisions. In experiments, we present adequate evaluations to show that the proposed DRPN achieves comparable and even better performance than the state-of-the-art methods. Furthermore, the DRPN is model-irrelevant and can be integrated into any video captioning models to refine their generated caption sentences.

中文翻译：

用于视频字幕的深度强化抛光网络

视频字幕任务旨在使用多个自然语言句子描述视频内容。尽管一步编码器-解码器模型取得了可喜的进展，但生成总是包含许多错误，这主要是由于视觉域和语言域之间的语义鸿沟较大以及长序列生成困难造成的。视频字幕的潜在挑战，即跨不同域的序列到序列映射，仍然没有得到很好的处理。受人类校对程序的启发，生成的字幕可以逐步打磨以提高其质量。在本文中，我们提出了一个深层增强抛光网络（DRPN）来精炼字幕候选者，它由一个用于修正单词错误的词去噪网络（WDN）和一个用于修正语法错误的语法检查网络（GCN）组成。一方面，深度强化学习中的长期奖励有利于长序列生成，它考虑了字幕句子的全局质量。另一方面，标题候选可以被认为是视觉和语言领域之间的桥梁，其中语义差距逐渐减少，通过反复修订生成更好的候选。在实验中，我们提出了足够的评估方法，以表明所提出的DRPN与最先进的方法相比具有可比甚至更好的性能。此外，DRPN 与模型无关，可以集成到任何视频字幕模型中以改进其生成的字幕句子。深度强化学习的长期回报有利于长序列生成，它考虑了字幕句子的全局质量。另一方面，标题候选可以被认为是视觉和语言领域之间的桥梁，其中语义差距逐渐减少，通过反复修订生成更好的候选。在实验中，我们提供了足够的评估，以表明所提出的 DRPN 实现了与最先进的方法相当甚至更好的性能。此外，DRPN与模型无关，可以集成到任何视频字幕模型中以完善其生成的字幕语句。深度强化学习的长期回报有利于长序列生成，它考虑了字幕句子的全局质量。另一方面，标题候选可以被认为是视觉和语言领域之间的桥梁，其中语义差距逐渐减少，通过反复修订生成更好的候选。在实验中，我们提供了足够的评估，以表明所提出的 DRPN 实现了与最先进的方法相当甚至更好的性能。此外，DRPN 与模型无关，可以集成到任何视频字幕模型中以改进其生成的字幕句子。字幕候选者可被视为视觉和语言领域之间的桥梁，其中语义差距逐渐减小，而重复的修订产生了更好的候选者。在实验中，我们提供了足够的评估，以表明所提出的 DRPN 实现了与最先进的方法相当甚至更好的性能。此外，DRPN 与模型无关，可以集成到任何视频字幕模型中以改进其生成的字幕句子。标题候选可以被认为是视觉和语言领域之间的桥梁，其中语义差距逐渐缩小，通过反复修改生成更好的候选。在实验中，我们提供了足够的评估，以表明所提出的 DRPN 实现了与最先进的方法相当甚至更好的性能。此外，DRPN 与模型无关，可以集成到任何视频字幕模型中以改进其生成的字幕句子。

更新日期：2020-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11