Deep Video Code for Efficient Face Video Retrieval,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Video Code for Efficient Face Video Retrieval
Pattern Recognition ( IF 8 ) Pub Date : 2020-11-01 , DOI: 10.1016/j.patcog.2020.107754
Shishi Qiao , Ruiping Wang , Shiguang Shan , Xilin Chen

Abstract In this paper, we address one specific video retrieval problem in terms of human face. Given one query in forms of either a frame or a sequence from a person, we search the database and return the most relevant face videos, i.e., ones have the same class label with the query. Such problem is very challenging due to the large intra-class variations and the high request on the efficiency of video representations in terms of both time and space. To handle such challenges, this paper proposes a novel Deep Video Code (DVC) method which encodes video faces into compact binary codes. Specifically, we devise an end-to-end convolutional neural network (CNN) framework that takes face videos as training inputs, models each of them as a unified representation by temporal feature pooling operation, and finally projects the high-dimensional representations of both frames and videos into Hamming space to generate binary codes. In such Hamming space, distance of dissimilar pairs is larger than that of similar pairs by a margin. To this end, a novel bounded triplet hashing loss is elaborately designed, which takes all dissimilar pairs into consideration for each anchor point in a mini-batch, and the optimization of the loss function is smoother and more stable. Extensive experiments on challenging video face databases and general image/video datasets with comparison to the state-of-the-arts verify the effectiveness of our method in different kinds of retrieval scenarios.

中文翻译：

用于高效人脸视频检索的深度视频代码

摘要在本文中，我们针对人脸解决了一个特定的视频检索问题。给定一个人的帧或序列形式的查询，我们搜索数据库并返回最相关的面部视频，即与查询具有相同类标签的视频。由于大的类内变化以及对视频表示在时间和空间方面的效率的高要求，这样的问题非常具有挑战性。为了应对这些挑战，本文提出了一种新颖的深度视频编码 (DVC) 方法，该方法将视频人脸编码为紧凑的二进制代码。具体来说，我们设计了一个端到端的卷积神经网络 (CNN) 框架，将面部视频作为训练输入，通过时间特征池操作将每个视频建模为统一表示，最后将帧和视频的高维表示投影到汉明空间以生成二进制代码。在这样的汉明空间中，不同对的距离比相似对的距离大一个幅度。为此，精心设计了一种新颖的有界三元组哈希损失，它在一个 mini-batch 中为每个锚点考虑了所有不同的对，并且损失函数的优化更平滑、更稳定。与现有技术相比，对具有挑战性的视频人脸数据库和一般图像/视频数据集的大量实验验证了我们的方法在不同类型的检索场景中的有效性。为此，精心设计了一种新颖的有界三元组哈希损失，它在一个 mini-batch 中为每个锚点考虑了所有不同的对，并且损失函数的优化更平滑、更稳定。与现有技术相比，对具有挑战性的视频人脸数据库和一般图像/视频数据集的大量实验验证了我们的方法在不同类型的检索场景中的有效性。为此，精心设计了一种新颖的有界三元组哈希损失，它在一个 mini-batch 中为每个锚点考虑了所有不同的对，并且损失函数的优化更平滑、更稳定。与现有技术相比，对具有挑战性的视频人脸数据库和一般图像/视频数据集的大量实验验证了我们的方法在不同类型检索场景中的有效性。

更新日期：2020-11-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>