当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Video Moment Localization using Object Evidence and Reverse Captioning
arXiv - CS - Multimedia Pub Date : 2020-06-18 , DOI: arxiv-2006.10260
Madhawa Vidanapathirana, Supriya Pandhre, Sonia Raychaudhuri, Anjali Khurana

We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities. This method encodes the semantic activity concepts from the verb/object pair in a language query and leverages visual activity concepts from video activity classification prediction scores. We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence via object segmentation masks and video understanding features via video captioning. Furthermore, we improve language modelling in sentence embedding. We experimented on Charades-STA dataset and identified that MML outperforms MAC baseline by 4.93% and 1.70% on R@1 and R@5metrics respectively. Our code and pre-trained model are publicly available at https://github.com/madhawav/MML.

中文翻译:

使用对象证据和反向字幕的视频时刻定位

我们解决了未修剪视频中基于语言的时刻时间定位问题。与固定类别的时间定位相比,这个问题更具挑战性,因为基于语言的查询没有预定义的活动类别,还可能包含复杂的描述。当前最先进的模型 MAC 通过从视频和语言模式中挖掘活动概念来解决它。该方法对语言查询中动词/对象对的语义活动概念进行编码,并利用视频活动分类预测分数中的视觉活动概念。我们提出了“Multi-faceted VideoMoment Localizer”(MML),这是 MAC 模型的扩展,通过对象分割掩码引入视觉对象证据,并通过视频字幕引入视频理解功能。此外,我们改进了句子嵌入中的语言建模。我们在 Charades-STA 数据集上进行了实验,发现 MML 在 R@1 和 R@5metrics 上分别优于 MAC 基线 4.93% 和 1.70%。我们的代码和预训练模型可在 https://github.com/madhawav/MML 上公开获得。
更新日期:2020-06-19
down
wechat
bug