Multi-scale Temporal Cues Learning for Video Person Re-Identification.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-scale Temporal Cues Learning for Video Person Re-Identification.
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2020-02-14 , DOI: 10.1109/tip.2020.2972108
Jianing Li , Shiliang Zhang , Tiejun Huang

Temporal cues embedded in videos provide important clues for person Re-Identification (ReID). To efficiently exploit temporal cues with a compact neural network, this work proposes a novel 3D convolution layer called Multi-scale 3D (M3D) convolution layer. The M3D layer is easy to implement and could be inserted into traditional 2D convolution networks to learn multi-scale temporal cues by end-to-end training. According to its inserted location, the M3D layer has two variants, i.e., local M3D layer and global M3D layer, respectively. The local M3D layer is inserted between 2D convolution layers to learn spatial-temporal cues among adjacent 2D feature maps. The global M3D layer is computed on adjacent frame feature vectors to learn their global temporal relations. The local and global M3D layers hence learn complementary temporal cues. Their combination introduces a fraction of parameters to traditional 2D CNN, but leads to the strong multi-scale temporal feature learning capability. The learned temporal feature is fused with a spatial feature to compose the final spatial-temporal representation for video person ReID. Evaluations on four widely used video person ReID datasets, i.e., MARS, DukeMTMC-VideoReID, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our method over the state-of-the art. For example, it achieves rank1 accuracy of 88.63% on MARS without re-ranking. Our method also achieves a reasonable trade-off between ReID accuracy and model size, e.g., it saves about 40% parameters of I3D CNN.

中文翻译：

用于视频人重新识别的多尺度时间线索学习。

视频中嵌入的时间提示为人员重新识别（ReID）提供重要线索。为了利用紧凑的神经网络有效地利用时间线索，这项工作提出了一种新颖的3D卷积层，称为多尺度3D（M3D）卷积层。M3D层易于实现，可以插入到传统的2D卷积网络中，以通过端到端训练来学习多尺度的时间线索。根据其插入位置，M3D层具有两个变体，分别是本地M3D层和全局M3D层。本地M3D层插入2D卷积层之间，以学习相邻2D特征图之间的时空提示。在相邻帧特征向量上计算全局M3D层，以了解其全局时间关系。因此，局部和全局M3D层学习互补的时间线索。它们的结合为传统的2D CNN引入了一部分参数，但带来了强大的多尺度时态学习能力。将学习到的时间特征与空间特征融合，以构成视频人ReID的最终时空表示。对四个广泛使用的视频人ReID数据集（即MARS，DukeMTMC-VideoReID，PRID2011和iLIDS-VID）进行的评估证明了我们的方法相对于最新技术的巨大优势。例如，它无需重新排名就可以在MARS上达到88.63％的排名精度。我们的方法还实现了ReID准确性和模型大小之间的合理权衡，例如，它节省了大约40％的I3D CNN参数。将学习到的时间特征与空间特征融合，以构成视频人ReID的最终时空表示。对四个广泛使用的视频人ReID数据集（即MARS，DukeMTMC-VideoReID，PRID2011和iLIDS-VID）进行的评估证明了我们的方法相对于最新技术的巨大优势。例如，它无需重新排名就可以在MARS上达到88.63％的排名精度。我们的方法还实现了ReID准确性和模型大小之间的合理权衡，例如，它节省了大约40％的I3D CNN参数。将学习到的时间特征与空间特征融合，以构成视频人ReID的最终时空表示。对四个广泛使用的视频人ReID数据集（即MARS，DukeMTMC-VideoReID，PRID2011和iLIDS-VID）进行的评估证明了我们的方法相对于最新技术的巨大优势。例如，它无需重新排名就可以在MARS上达到88.63％的排名精度。我们的方法还实现了ReID准确性和模型大小之间的合理权衡，例如，它节省了大约40％的I3D CNN参数。例如，它无需重新排名就可以在MARS上达到88.63％的排名精度。我们的方法还实现了ReID准确性和模型大小之间的合理权衡，例如，它节省了大约40％的I3D CNN参数。例如，它无需重新排名就可以在MARS上达到88.63％的排名精度。我们的方法还实现了ReID准确性和模型大小之间的合理权衡，例如，它节省了大约40％的I3D CNN参数。

更新日期：2020-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>