Frame-wise Cross-modal Match for Video Moment Retrieval,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Frame-wise Cross-modal Match for Video Moment Retrieval
arXiv - CS - Multimedia Pub Date : 2020-09-22 , DOI: arxiv-2009.10434
Haoyu Tang, Jihua Zhu, Meng Liu, Member, IEEE, Zan Gao, and Zhiyong Cheng

Video moment retrieval targets at retrieving a golden moment in a video for a given natural language query. The main challenges of this task include 1) the requirement of accurately localizing (i.e., the start time and the end time of) the relevant moment in an untrimmed video stream, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal bounders based on an interaction modeling between two modalities. In addition, an attention module is introduced to automatically assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two public datasetsdemonstrate the superiority of our method over several state-of-the-art methods.

中文翻译：

用于视频时刻检索的逐帧跨模态匹配

视频时刻检索的目标是为给定的自然语言查询检索视频中的黄金时刻。这项任务的主要挑战包括 1）准确定位（即开始时间和结束时间）未修剪视频流中的相关时刻的要求，以及 2）弥合文本查询和视频内容之间的语义鸿沟。为了解决这些问题，一种主流方法是为目标查询和视频帧生成多模态特征向量（例如，串联），然后对多模态特征向量使用回归方法进行边界检测。尽管这种方法取得了一些进展，但我们认为这些方法没有很好地捕捉查询和视频帧之间的跨模式交互。在本文中，我们提出了一种注意力集中的跨模态相关匹配（ACRM）模型，该模型基于两种模态之间的交互建模来预测时间边界。此外，引入了一个注意力模块来自动为具有更丰富语义线索的查询词分配更高的权重，这些词被认为对于查找相关视频内容更重要。另一个贡献是我们提出了一个额外的预测器来利用模型训练中的内部框架来提高定位精度。对两个公共数据集的大量实验证明了我们的方法优于几种最先进的方法。引入了一个注意力模块，自动为具有更丰富语义线索的查询词分配更高的权重，这些词被认为对于查找相关视频内容更重要。另一个贡献是我们提出了一个额外的预测器来利用模型训练中的内部框架来提高定位精度。对两个公共数据集的大量实验证明了我们的方法优于几种最先进的方法。引入了一个注意力模块，以自动为具有更丰富语义线索的查询词分配更高的权重，这些词被认为对于查找相关视频内容更重要。另一个贡献是我们提出了一个额外的预测器来利用模型训练中的内部框架来提高定位精度。对两个公共数据集的大量实验证明了我们的方法优于几种最先进的方法。

更新日期：2020-09-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文