BiTransformer: augmenting semantic context in video captioning via bidirectional decoder,Machine Vision and Applications

当前位置： X-MOL 学术 › Mach. Vis. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Machine Vision and Applications ( IF 2.4 ) Pub Date : 2022-08-12 , DOI: 10.1007/s00138-022-01329-3
Maosheng Zhong , Hao Zhang , Yong Wang , Hao Xiong

Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder–decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.

中文翻译：

BiTransformer：通过双向解码器增强视频字幕中的语义上下文

视频字幕是许多应用程序中涉及的一个重要问题。它旨在生成对视频内容的一些描述。大多数现有的视频字幕方法都是基于深度编码器-解码器模型，特别是基于注意力的模型（比如 Transformer）。然而，现有的基于转换器的模型可能没有充分利用语义上下文，即只使用从左到右的上下文风格，而忽略了从右到左的对应物。在本文中，我们介绍了一种双向（前后）解码器，以利用从左到右和从右到左两种风格的上下文，用于基于 Transformer 的视频字幕模型。因此，我们的模型称为双向 Transformer（称为 BiTransformer）。具体来说，在现有的基于 Transformer 的模型中使用的编码器和前向解码器（旨在捕获从左到右的上下文）的桥接中，我们插入一个后向解码器来捕获从右到左的上下文。配备这样的双向解码器，将更充分地利用视频的语义上下文，从而产生更好的视频字幕。通过与最先进的方法进行比较，我们的模型的有效性在两个基准数据集上得到证明，即 MSVD 和 MSR-VTT。特别是，在重要的评估指标 CIDEr 方面，所提出的模型优于最先进的模型，在两个数据集中都有 1.2% 的改进。视频的语义上下文将得到更充分的利用，从而产生更好的视频字幕。通过与最先进的方法进行比较，我们的模型的有效性在两个基准数据集上得到证明，即 MSVD 和 MSR-VTT。特别是，在重要的评估指标 CIDEr 方面，所提出的模型优于最先进的模型，在两个数据集中都有 1.2% 的改进。视频的语义上下文将得到更充分的利用，从而产生更好的视频字幕。通过与最先进的方法进行比较，我们的模型的有效性在两个基准数据集上得到证明，即 MSVD 和 MSR-VTT。特别是，在重要的评估指标 CIDEr 方面，所提出的模型优于最先进的模型，在两个数据集中都有 1.2% 的改进。

更新日期：2022-08-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11