Image-to-video person re-identification with cross-modal embeddings
Introduction
Person re-identification(re-id) is a task of recognizing people across images and videos from non-overlapping camera views. With widespread use of surveillance cameras and enhanced awareness of public security, person re-id has attracted especial attention of computer vision and pattern recognition communities [25], [28].
In general, there are two major types of deep learning models for person re-identification, i.e. verification models and identification models. Verification models take a pair of data as input and determine whether they belong to the same person or not. They only leverage weak re-id labels and can be regarded as a binary-class classification or similarity regression task [33]. While identification models aim at feature learning by treating person re-identification as a multi-class classification task [30], but lack direct similarity measurement between input pair. Due to the complementary advantages and limitations, the two models are combined to improve the performance in the homogeneous re-id task [11], [38]. To our best knowledge, the combination haven’t been applied in the image-to-video re-id yet.
In the image-to-video person re-identification task, a cross-modal system, the main challenge is how to match between different modalities, i.e. image and video. Directly using the information provided by target task cannot perfectly bridge the “media gap”, which means that representations of different modalities are inconsistent.
For the purpose of addressing the limitations, we propose a novel end-to-end framework for image-to-video person re-identification by leveraging cross-modal embeddings from different but related tasks. Concretely speaking, the proposed framework can be divided into feature representation sub-network and verification-identification sub-network. The former consists of CNNs and LSTMs for extracting image feature and video spatiotemporal feature, and the latter combines the strengths of identification model and verification model to improve the discriminative ability of the learned feature representations.
Furthermore, cross-modal embeddings, i.e. image-to-text and video-to-text embedding layers are integrated into feature representation sub-network to learn common latent embeddings for video-text and image-text modalities. Textual descriptive information mainly focuses on the person’s prominent distinctions and neglects the details of background, which are useless for person re-id. Hence, the learned multimodal embeddings would pay attention to similar details, which can help image-to-video person re-id narrow down the “media gap” and improve the performance in the subsequent process. For example, an image depicting a woman with a brown hat easily matches the video depicting a woman with a black hat in the traditional person re-id implementation, which can be overcome with the help of textual descriptive information generated by the corresponding cross-modal embedding layers.
Our main contributions can be summarized as follows.
- •
We propose an end-to-end cross-modal framework for image-to-video person re-id, which leverages CNNs and LSTMs to extract the visual features and spatiotemporal motion features in the feature representation sub-network. Additionally, identification loss and verification loss are combined in the verification-identification sub-network so that the framework can simultaneously learn discriminative feature representations and a similarity metric.
- •
The feature representation sub-network in our proposed framework integrates cross-modal embeddings for different but related tasks, to learn the common latent embeddings for video-text and image-text modalities. In this way, text information containing person’s explicit characteristics can be incorporated and the learned common latent embeddings would pay more attention to person’s prominent distinctions in the image or video.
- •
We have conducted several experiments on two publicly available person image sequence datasets. The person re-identification results on the PRID-2011 and ILDS-VID benchmarks well verify the effectiveness of our proposed person re-identification framework by comparing with several existing state-of-the-art approaches.
The remainder of this paper is organized as follows. Section 2 briefly reviews some related work about person re-identification in recent years. Then we elaborate on the proposed framework for image-to-video person re-id and present its each part at length in Section 3, followed by experiments in Section 4 together with conclusions and future work in Section 5.
Section snippets
Related work
In this section, we briefly introduce some previous works related to person re-identification and the framework proposed in this paper.
Framework overview
Image-to-video person re-identification problem can be formulated as follows:
Problem formulation for image-to-video person re-id Input: Probe image for a specific pedestrian Gallery consisting of videos captured by non-overlapping cameras Output: Video in the gallery for the same pedestrian as the probe image
When dealing with the image-to-video person re-identification task, two major challenges will be inevitably encountered, i.e. how to represent the features of image and video, and how to train the
Experiment
In this section, we mainly evaluate the effectiveness of our proposed framework in the image-to-video person re-identification task. The datasets will be first introduced, followed by evaluation, and the comparison with several state-of-the-art approaches for image-to-video person re-identification will also be presented.
Conclusion and future work
In this paper, we present a novel end-to-end framework for image-to-video person re-identification, which extracts visual and spatiotemporal information from image and video by LSTMs and CNNs, and takes advantage of the strengths of identification model and verification model to improve the discriminative ability of learned representations. What’s more, cross-model embedding layers from different but related tasks are integrated into our framework to facilitate the learned representations to
Declaration of Competing Interest
None.
Acknowledgments
This work is supported by the National Social Science Foundation of China (Grant No. 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).
References (38)
- et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognit.
(2015) - et al.
Fully-automated person re-identification in multi-camera surveillance system with a robust kernel descriptor and effective shadow removal method
Image Vision Comput.
(2017) - et al.
Beyond low-rank representations: orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering
Neural Netw.
(2018) - et al.
What-and-where to match: deep spatially multiplicative integration networks for person re-identification
Pattern Recognit.
(2018) - et al.
An enhanced deep feature representation for person re-identification
Applications of Computer Vision
(2016) - et al.
An improved deep learning architecture for person re-identification
Computer Vision and Pattern Recognition
(2015) - et al.
Person re-identification by symmetry-driven accumulation of local features
Computer Vision and Pattern Recognition
(2010) - et al.
Speech recognition with deep recurrent neural networks
IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26–31, 2013
(2013) - et al.
Person re-identification by descriptive and discriminative classification
Scandinavian Conference on Image Analysis
(2011) - et al.
A spatio-temporal descriptor based on 3d-gradients
Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008
(2008)
Deepreid: deep filter pairing neural network for person re-identification
Computer Vision and Pattern Recognition
Learning locally-adaptive decision functions for person verification
IEEE Conference on Computer Vision and Pattern Recognition
Person re-identification by local maximal occurrence representation and metric learning
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015
Semi-supervised coupled dictionary learning for person re-identification
IEEE Conference on Computer Vision and Pattern Recognition
Recurrent convolutional network for video-based person re-identification
Computer Vision and Pattern Recognition
A reliable image-to-video person re-identification based on feature fusion
Intelligent Information and Database Systems - 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam, March 19–21, 2018, Proceedings, Part I
Local fisher discriminant analysis for pedestrian re-identification
Computer Vision and Pattern Recognition
Deep learning face representation by joint identification-verification
Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada
Rethinking the inception architecture for computer vision
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016
Cited by (9)
Image-to-video person re-identification using three-dimensional semantic appearance alignment and cross-modal interactive learning
2022, Pattern RecognitionCitation Excerpt :The results of the TKP method reported in Table 5 are first pre-trained on the large-scale MARS dataset following [19]. From Table 5, it can be seen that the proposed method can overhead many I2V ReID methods, like MPHDL [25], TMSL [17], P2SNet [16] and Xie et al. [23]. From the comparisons between the baseline and our method, it can be observed that though our method does not achieve the large performance gain, the results are still competitive.
Diverse part attentive network for video-based person re-identification
2021, Pattern Recognition LettersCitation Excerpt :Video-based person re-identification (re-ID) aims to match video tracklets of people across non-overlapping cameras. It has been studied extensively in recent years [24,31,32]. Compared with images, videos provide more temporal information about pedestrians.
Visual feature-based improved EfficientNet-GRU for Fritillariae Cirrhosae Bulbus identification
2024, Multimedia Tools and ApplicationsInformation disentanglement based cross-modal representation learning for visible-infrared person re-identification
2023, Multimedia Tools and ApplicationsHuman Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models
2022, ACM Computing Surveys
Editor: Prof. G. Sanniti di Baja.