Image-to-video person re-identification with cross-modal embeddings,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Image-to-video person re-identification with cross-modal embeddings
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2019-03-07 , DOI: 10.1016/j.patrec.2019.03.003
Zhongwei Xie , Lin Li , Xian Zhong , Luo Zhong , Jianwen Xiang

Despite the great progress achieved, image-to-video person re-identification is still challenging in the cross-modal scenario. Currently, state-of-the-art approaches mainly concentrate on the task-specific data, neglecting the extra information from the different but related tasks. In this paper, we propose an end-to-end neural network framework for image-to-video person re-identification with cross-modal embeddings learned from extra information. Concretely speaking, cross-modal embedding layers from image captioning and video captioning models, are incorporated to learn common latent embeddings for multiple modalities. The learned multimodal embeddings are expected to focus on person’s prominent distinctions, due to textual descriptive information generally paying close attention to person’s explicit characteristics. Apart from that, our proposed framework resorts to CNNs and LSTMs for extracting visual and spatiotemporal features, and combines the strengths of identification and verification model to improve the discriminative ability of the learned features. The experimental results demonstrate the effectiveness of our framework on narrowing down the gap between heterogeneous data and obtaining observable improvement in the image-to-video person re-identification task.

中文翻译：

带有交叉模式嵌入的图像到视频人员重新识别

尽管取得了巨大的进步，但在交叉模式场景中，图像到视频人员的重新识别仍然具有挑战性。当前，最先进的方法主要集中在特定于任务的数据上，而忽略了来自不同但相关的任务的额外信息。在本文中，我们提出了一种端到端神经网络框架，该框架具有从额外信息中获悉的交叉模式嵌入，从而可以实现图像到视频人的重新识别。具体而言，将来自图像字幕和视频字幕模型的交叉模式嵌入层合并在一起，以学习多种模式的常见潜在嵌入。由于文本描述性信息通常会密切关注人的显式特征，因此预期学习的多峰嵌入将着重于人的显着差异。除此之外，我们提出的框架依靠CNN和LSTM提取视觉和时空特征，并结合了识别和验证模型的优势，以提高学习特征的判别能力。实验结果证明了我们的框架在缩小异构数据之间的差距以及在图像到视频人员重新识别任务中获得可观察到的改进方面的有效性。

更新日期：2020-03-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>