Single-shot Semantic Matching Network for Moment Localization in Videos,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Single-shot Semantic Matching Network for Moment Localization in Videos
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2021-07-22 , DOI: 10.1145/3441577
Xinfang Liu ₁ , Xiushan Nie ₂ , Junya Teng ₁ , Li Lian ₁ , Yilong Yin ₁

Affiliation

Moment localization in videos using natural language refers to finding the most relevant segment from videos given a natural language query. Most of the existing methods require video segment candidates for further matching with the query, which leads to extra computational costs, and they may also not locate the relevant moments under any length evaluated. To address these issues, we present a lightweight single-shot semantic matching network (SSMN) to avoid the complex computations required to match the query and the segment candidates, and the proposed SSMN can locate moments of any length theoretically. Using the proposed SSMN, video features are first uniformly sampled to a fixed number, while the query sentence features are generated and enhanced by GloVe, long-term short memory (LSTM), and soft-attention modules. Subsequently, the video features and sentence features are fed to an enhanced cross-modal attention model to mine the semantic relationships between vision and language. Finally, a score predictor and a location predictor are designed to locate the start and stop indexes of the query moment. We evaluate the proposed method on two benchmark datasets and the experimental results demonstrate that SSMN outperforms state-of-the-art methods in both precision and efficiency.

中文翻译：

用于视频中时刻定位的单次语义匹配网络

使用自然语言的视频中的时刻本地化是指从给定自然语言查询的视频中找到最相关的片段。大多数现有方法都需要视频片段候选以与查询进一步匹配，这会导致额外的计算成本，并且它们也可能无法在任何评估长度下定位相关时刻。为了解决这些问题，我们提出了一种轻量级的单次语义匹配网络（SSMN），以避免匹配查询和候选片段所需的复杂计算，并且所提出的 SSMN 在理论上可以定位任意长度的时刻。使用所提出的 SSMN，视频特征首先被均匀地采样到一个固定数量，而查询语句特征由 GloVe、长期短记忆 (LSTM) 和软注意力模块生成和增强。随后，视频特征和句子特征被馈送到增强的跨模态注意力模型中，以挖掘视觉和语言之间的语义关系。最后，设计了一个分数预测器和一个位置预测器来定位查询时刻的开始和停止索引。我们在两个基准数据集上评估了所提出的方法，实验结果表明 SSMN 在精度和效率上都优于最先进的方法。

更新日期：2021-07-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文