当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Diverse video captioning through latent variable expansion
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2022-05-22 , DOI: 10.1016/j.patrec.2022.05.021
Huanhou Xiao , Jinglun Shi

Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.



中文翻译:

通过潜在变量扩展的多样化视频字幕

用文字描述自动描述视频内容是一项具有挑战性但重要的任务,在计算机视觉界引起了广泛的关注。以前的工作主要追求生成句子的准确性,而忽略了句子的多样性,这与人类行为不符。在本文中,我们旨在为每个视频添加多个描述,并提出一个新颖的框架。具体来说,对于给定的视频,传统编码-解码过程的中间潜在变量被用作条件生成对抗网络(CGAN)的输入,目的是生成不同的句子。我们采用不同的卷积神经网络 (CNN) 作为生成器,生成以潜在变量为条件的描述,以及评估生成句子质量的鉴别器。同时,设计了一种新颖的 DCE 度量来评估不同的字幕。我们在基准数据集上评估我们的方法,它展示了它生成各种描述的能力,并与其他最先进的方法相比取得了优异的结果。

更新日期:2022-05-22
down
wechat
bug