Improving Video Retrieval by Adaptive Margin,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving Video Retrieval by Adaptive Margin
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-03-09 , DOI: arxiv-2303.05093
Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lv, Yong zhu, Xiao Tan

Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.

中文翻译：

通过自适应边距改进视频检索

由于视频在 Internet 上的迅速出现，视频检索变得越来越重要。视频检索的主导范式通过将正对相似度与负对相似度之间的距离拉开固定边距来学习视频文本表示。然而，用于训练的负对是随机采样的，这表明负对之间的语义可能相关甚至等效，而大多数方法仍然强制执行不同的表示以降低它们的相似性。这种现象导致学习视频文本表示时的监督不准确和表现不佳。虽然大多数视频检索方法都忽略了这一现象，但我们提出了一种随正负对之间的距离变化的自适应边距来解决上述问题。第一的，我们设计了自适应间隔的计算框架，包括距离测量的方法和距离与间隔之间的函数。然后，我们探索了一种称为“跨模态广义自蒸馏”(CMGSD) 的新实现，它可以在大多数视频检索模型的基础上构建，只需很少的修改。值得注意的是，CMGSD 在训练时增加了很少的计算开销，并且在测试时没有增加计算开销。在三个广泛使用的数据集上的实验结果表明，所提出的方法可以产生比相应的骨干模型明显更好的性能，并且大大优于最先进的方法。我们探索了一种称为“跨模态广义自蒸馏”（CMGSD）的新实现，它可以在大多数视频检索模型的基础上构建，只需很少的修改。值得注意的是，CMGSD 在训练时增加了很少的计算开销，并且在测试时没有增加计算开销。在三个广泛使用的数据集上的实验结果表明，所提出的方法可以产生比相应的骨干模型明显更好的性能，并且大大优于最先进的方法。我们探索了一种称为“跨模态广义自蒸馏”（CMGSD）的新实现，它可以在大多数视频检索模型的基础上构建，只需很少的修改。值得注意的是，CMGSD 在训练时增加了很少的计算开销，并且在测试时没有增加计算开销。在三个广泛使用的数据集上的实验结果表明，所提出的方法可以产生比相应的骨干模型明显更好的性能，并且大大优于最先进的方法。

更新日期：2023-03-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>