当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-modal Dense Video Captioning
arXiv - CS - Machine Learning Pub Date : 2020-03-17 , DOI: arxiv-2003.07758
Vladimir Iashin and Esa Rahtu

Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos. Code is publicly available: github.com/v-iashin/MDVC

中文翻译:

多模式密集视频字幕

密集视频字幕是一项从未修剪的视频中本地化有趣事件并为每个本地化事件生成文本描述(字幕)的任务。以往密集视频字幕的大部分作品都完全基于视觉信息,完全忽略了音轨。然而,音频,尤其是语音,是人类观察者理解环境的重要线索。在本文中,我们提出了一种新的密集视频字幕方法,该方法能够利用任意数量的模式进行事件描述。具体来说,我们展示了音频和语音模式如何改进密集视频字幕模型。我们应用自动语音识别 (ASR) 系统来获得语音的时间对齐文本描述(类似于字幕),并将其视为与视频帧和相应音轨一起的单独输入。我们将字幕任务制定为机器翻译问题,并利用最近提出的 Transformer 架构将多模态输入数据转换为文本描述。我们在 ActivityNet Captions 数据集上展示了我们的模型的性能。消融研究表明,音频和语音组件有相当大的贡献,这表明这些模式包含对视频帧的大量补充信息。此外,我们利用从原始 YouTube 视频中获得的类别标签,对 ActivityNet Caption 结果进行了深入分析。代码是公开的:
更新日期:2020-05-07
down
wechat
bug