Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning,IEEE Transactions on Circuits and Systems for Video Technology

当前位置： X-MOL 学术 › IEEE Trans. Circ. Syst. Video Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.3 ) Pub Date : 4-8-2022 , DOI: 10.1109/tcsvt.2022.3165934
Wanru Xu ₁ , Zhenjiang Miao ₁ , Jian Yu ₂ , Yi Tian ₂ , Lili Wan ₁ , Qiang Ji ₃

Affiliation

Video captioning is a joint task of computer vision and natural language processing, which aims to describe the video content using several natural language sentences. Nowadays, most methods cast this task as a mapping problem, which learns a mapping from visual features to natural language and generates captions directly from videos. However, the underlying challenge of video captioning, i.e., sequence to sequence mapping across the different domains, is still not well handled. To address these problems, we introduce the polishing mechanism in an attempt to mimic human polishing process and propose a generate-and-polish framework for video captioning. In this paper, we propose a two-step transformer based polishing network (TSTPN) consisting of two sub-modules: the generation-module is to generate the caption candidate and the polishing-module is to gradually refine the generated candidate. Specifically, the candidate provides a global information of the visual contents in a semantically-meaningful order, where it is firstly considered as a semantic intersnubber to bridge the semantic gap between the text and video, with the cross-modal attention mechanism for better cross-modal modeling; and it secondly provides a global planning ability to maintain the semantic consistency and fluency of the whole sentence for better sequence mapping. In experiments, we present adequate evaluations to show that the proposed TSTPN achieves the comparable and even better performance than the state-of-the-art methods on the benchmark datasets.

中文翻译：

连接视频和文本：用于视频字幕的两步抛光变压器

视频字幕是计算机视觉和自然语言处理的联合任务，旨在使用多个自然语言句子来描述视频内容。如今，大多数方法将此任务视为映射问题，它学习从视觉特征到自然语言的映射并直接从视频生成字幕。然而，视频字幕的潜在挑战，即跨不同域的序列到序列映射，仍然没有得到很好的处理。为了解决这些问题，我们引入了抛光机制，试图模仿人类的抛光过程，并提出了一种用于视频字幕的生成和抛光框架。在本文中，我们提出了一种基于两步变压器的抛光网络（TSTPN），由两个子模块组成：生成模块用于生成标题候选，抛光模块用于逐渐细化生成的候选。具体来说，候选者以语义有意义的顺序提供视觉内容的全局信息，首先将其视为语义中间缓冲器，以弥合文本和视频之间的语义差距，并通过跨模态注意机制实现更好的跨模态注意力机制。模态建模；其次，它提供了全局规划能力，以保持整个句子的语义一致性和流畅性，以实现更好的序列映射。在实验中，我们进行了充分的评估，表明所提出的 TSTPN 在基准数据集上实现了与最先进的方法相当甚至更好的性能。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11