当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-Sentence Video Captioning using Content-oriented Beam Searching and Multi-stage Refining Algorithm
Information Processing & Management ( IF 8.6 ) Pub Date : 2020-06-16 , DOI: 10.1016/j.ipm.2020.102302
Masoomeh Nabati , Alireza Behrad

With the increasing growth of video data, especially in cyberspace, video captioning or the representation of video data in the form of natural language has been receiving an increasing amount of interest in several applications like video retrieval, action recognition, and video understanding, to name a few. In recent years, deep neural networks have been successfully applied for the task of video captioning. However, most existing methods describe a video clip using only one sentence that may not correctly cover the semantic content of the video clip. In this paper, a new multi-sentence video captioning algorithm is proposed using a content-oriented beam search approach and a multi-stage refining method. We use a new content-oriented beam search algorithm to update the probabilities of words generated by the trained deep networks. The proposed beam search algorithm leverages the high-level semantic information of an input video using an object detector and the structural dictionary of sentences. We also use a multi-stage refining approach to remove structurally wrong sentences as well as sentences that are less related to the semantic content of the video. To this intent, a new two-branch deep neural network is proposed to measure the relevance score between a sentence and a video. We evaluated the performance of the proposed method with two popular video captioning databases and compared the results with the results of some state-of-the-art approaches. The experiments showed the superior performance of the proposed algorithm. For instance, in the MSVD database, the proposed method shows an enhancement of 6% for the best-1 sentences in comparison to the best state-of-the-art alternative.



中文翻译:

使用面向内容的波束搜索和多阶段优化算法的多句子视频字幕

随着视频数据的增长,特别是在网络空间中,视频字幕或以自然语言形式表示视频数据已引起人们对视频检索,动作识别和视频理解等多种应用的关注。一些。近年来,深度神经网络已成功应用于视频字幕任务。但是,大多数现有方法仅使用可能无法正确覆盖视频剪辑语义内容的一个句子来描述视频剪辑。本文提出了一种新的基于内容的光束搜索方法和多阶段精炼方法的多句子视频字幕算法 我们使用一种新的面向内容的波束搜索算法来更新由受过训练的深度网络生成的单词的概率。所提出的波束搜索算法使用对象检测器和句子的结构字典来利用输入视频的高级语义信息。我们还使用多阶段优化方法来删除结构错误的句子以及与视频的语义内容关系不大的句子。为此,提出了一种新的两分支深度神经网络,用于测量句子和视频之间的相关性得分。我们使用两个流行的视频字幕数据库评估了该方法的性能,并将结果与​​某些最新方法的结果进行了比较。实验证明了该算法的优越性能。例如,

更新日期:2020-06-16
down
wechat
bug