Image-to-video person re-identification with cross-modal embeddings

doi:10.1016/j.patrec.2019.03.003

Pattern Recognition Letters

Volume 133, May 2020, Pages 70-76

https://doi.org/10.1016/j.patrec.2019.03.003 Get rights and content

Highlights

•
We propose an end-to-end cross-modal framework for image-to-video re-id.
•
Cross-modal embeddings for related tasks are integrated to learn features.
•
Experiments results well verify the effectiveness of proposed framework.

Abstract

Despite the great progress achieved, image-to-video person re-identification is still challenging in the cross-modal scenario. Currently, state-of-the-art approaches mainly concentrate on the task-specific data, neglecting the extra information from the different but related tasks. In this paper, we propose an end-to-end neural network framework for image-to-video person re-identification with cross-modal embeddings learned from extra information. Concretely speaking, cross-modal embedding layers from image captioning and video captioning models, are incorporated to learn common latent embeddings for multiple modalities. The learned multimodal embeddings are expected to focus on person’s prominent distinctions, due to textual descriptive information generally paying close attention to person’s explicit characteristics. Apart from that, our proposed framework resorts to CNNs and LSTMs for extracting visual and spatiotemporal features, and combines the strengths of identification and verification model to improve the discriminative ability of the learned features. The experimental results demonstrate the effectiveness of our framework on narrowing down the gap between heterogeneous data and obtaining observable improvement in the image-to-video person re-identification task.

Introduction

Person re-identification(re-id) is a task of recognizing people across images and videos from non-overlapping camera views. With widespread use of surveillance cameras and enhanced awareness of public security, person re-id has attracted especial attention of computer vision and pattern recognition communities [25], [28].

In general, there are two major types of deep learning models for person re-identification, i.e. verification models and identification models. Verification models take a pair of data as input and determine whether they belong to the same person or not. They only leverage weak re-id labels and can be regarded as a binary-class classification or similarity regression task [33]. While identification models aim at feature learning by treating person re-identification as a multi-class classification task [30], but lack direct similarity measurement between input pair. Due to the complementary advantages and limitations, the two models are combined to improve the performance in the homogeneous re-id task [11], [38]. To our best knowledge, the combination haven’t been applied in the image-to-video re-id yet.

In the image-to-video person re-identification task, a cross-modal system, the main challenge is how to match between different modalities, i.e. image and video. Directly using the information provided by target task cannot perfectly bridge the “media gap”, which means that representations of different modalities are inconsistent.

For the purpose of addressing the limitations, we propose a novel end-to-end framework for image-to-video person re-identification by leveraging cross-modal embeddings from different but related tasks. Concretely speaking, the proposed framework can be divided into feature representation sub-network and verification-identification sub-network. The former consists of CNNs and LSTMs for extracting image feature and video spatiotemporal feature, and the latter combines the strengths of identification model and verification model to improve the discriminative ability of the learned feature representations.

Furthermore, cross-modal embeddings, i.e. image-to-text and video-to-text embedding layers are integrated into feature representation sub-network to learn common latent embeddings for video-text and image-text modalities. Textual descriptive information mainly focuses on the person’s prominent distinctions and neglects the details of background, which are useless for person re-id. Hence, the learned multimodal embeddings would pay attention to similar details, which can help image-to-video person re-id narrow down the “media gap” and improve the performance in the subsequent process. For example, an image depicting a woman with a brown hat easily matches the video depicting a woman with a black hat in the traditional person re-id implementation, which can be overcome with the help of textual descriptive information generated by the corresponding cross-modal embedding layers.

Our main contributions can be summarized as follows.

•
We propose an end-to-end cross-modal framework for image-to-video person re-id, which leverages CNNs and LSTMs to extract the visual features and spatiotemporal motion features in the feature representation sub-network. Additionally, identification loss and verification loss are combined in the verification-identification sub-network so that the framework can simultaneously learn discriminative feature representations and a similarity metric.
•
The feature representation sub-network in our proposed framework integrates cross-modal embeddings for different but related tasks, to learn the common latent embeddings for video-text and image-text modalities. In this way, text information containing person’s explicit characteristics can be incorporated and the learned common latent embeddings would pay more attention to person’s prominent distinctions in the image or video.
•
We have conducted several experiments on two publicly available person image sequence datasets. The person re-identification results on the PRID-2011 and ILDS-VID benchmarks well verify the effectiveness of our proposed person re-identification framework by comparing with several existing state-of-the-art approaches.

The remainder of this paper is organized as follows. Section 2 briefly reviews some related work about person re-identification in recent years. Then we elaborate on the proposed framework for image-to-video person re-id and present its each part at length in Section 3, followed by experiments in Section 4 together with conclusions and future work in Section 5.

Section snippets

Related work

In this section, we briefly introduce some previous works related to person re-identification and the framework proposed in this paper.

Framework overview

Image-to-video person re-identification problem can be formulated as follows:

Problem formulation for image-to-video person re-id
Input:
Probe image for a specific pedestrian
Gallery consisting of videos captured by non-overlapping
cameras
Output:
Video in the gallery for the same pedestrian as the probe
image

When dealing with the image-to-video person re-identification task, two major challenges will be inevitably encountered, i.e. how to represent the features of image and video, and how to train the

Experiment

In this section, we mainly evaluate the effectiveness of our proposed framework in the image-to-video person re-identification task. The datasets will be first introduced, followed by evaluation, and the comparison with several state-of-the-art approaches for image-to-video person re-identification will also be presented.

Conclusion and future work

In this paper, we present a novel end-to-end framework for image-to-video person re-identification, which extracts visual and spatiotemporal information from image and video by LSTMs and CNNs, and takes advantage of the strengths of identification model and verification model to improve the discriminative ability of learned representations. What’s more, cross-model embedding layers from different but related tasks are integrated into our framework to facilitate the learned representations to

Declaration of Competing Interest

None.

Acknowledgments

This work is supported by the National Social Science Foundation of China (Grant No. 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).

References (38)

S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognit.
(2015)
T.T.T. Pham et al.
Fully-automated person re-identification in multi-camera surveillance system with a robust kernel descriptor and effective shadow removal method
Image Vision Comput.
(2017)
Y. Wang et al.
Beyond low-rank representations: orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering
Neural Netw.
(2018)
L. Wu et al.
What-and-where to match: deep spatially multiplicative integration networks for person re-identification
Pattern Recognit.
(2018)
S. Wu et al.
An enhanced deep feature representation for person re-identification
Applications of Computer Vision
(2016)
E. Ahmed et al.
An improved deep learning architecture for person re-identification
Computer Vision and Pattern Recognition
(2015)
M. Farenzena et al.
Person re-identification by symmetry-driven accumulation of local features
Computer Vision and Pattern Recognition
(2010)
A. Graves et al.
Speech recognition with deep recurrent neural networks
IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26–31, 2013
(2013)
M. Hirzer et al.
Person re-identification by descriptive and discriminative classification
Scandinavian Conference on Image Analysis
(2011)
A. Kläser et al.
A spatio-temporal descriptor based on 3d-gradients
Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008
(2008)

W. Li et al.

Deepreid: deep filter pairing neural network for person re-identification

Computer Vision and Pattern Recognition

(2014)

Z. Li et al.

Learning locally-adaptive decision functions for person verification

IEEE Conference on Computer Vision and Pattern Recognition

(2013)

S. Liao et al.

Person re-identification by local maximal occurrence representation and metric learning

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015

(2015)

X. Liu et al.

Semi-supervised coupled dictionary learning for person re-identification

IEEE Conference on Computer Vision and Pattern Recognition

(2014)

N. Mclaughlin et al.

Recurrent convolutional network for video-based person re-identification

Computer Vision and Pattern Recognition

(2016)

B. Nguyen et al.

A reliable image-to-video person re-identification based on feature fusion

Intelligent Information and Database Systems - 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam, March 19–21, 2018, Proceedings, Part I

(2018)

S. Pedagadi et al.

Local fisher discriminant analysis for pedestrian re-identification

Computer Vision and Pattern Recognition

(2013)

Y. Sun et al.

Deep learning face representation by joint identification-verification

Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada

(2014)

C. Szegedy et al.

Rethinking the inception architecture for computer vision

2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016

(2016)

Cited by (9)

Image-to-video person re-identification using three-dimensional semantic appearance alignment and cross-modal interactive learning
2022, Pattern Recognition
Citation Excerpt :
The results of the TKP method reported in Table 5 are first pre-trained on the large-scale MARS dataset following [19]. From Table 5, it can be seen that the proposed method can overhead many I2V ReID methods, like MPHDL [25], TMSL [17], P2SNet [16] and Xie et al. [23]. From the comparisons between the baseline and our method, it can be observed that though our method does not achieve the large performance gain, the results are still competitive.
Image-to-video person re-identification (I2V ReID), which aims to retrieve human targets between image-based queries and video-based galleries, has recently become a new research focus. However, the appearance misalignment and modality misalignment in both images and videos caused by pose variations, camera views, misdetections, and different data types, make I2V ReID still challenging. To this end, we propose a deep I2V ReID pipeline based on three-dimensional semantic appearance alignment (3D-SAA) and cross-modal interactive learning (CMIL) to address the aforementioned two challenges. Specifically, in the 3D-SAA module, the aligned local appearance images extracted by dense 3D human appearance estimation are in conjunction with global image and video embedding streams to learn more fine-grained identity features. The aligned local appearance images are further semantically aggregated by the proposed multi-branch aggregation network to weaken the negligible body parts. Moreover, to overcome the influence of modality misalignment, a CMIL module enables the communication between global image and video streams by interactively propagating the temporal information in videos to the channels of image feature maps. Extensive experiments on challenging MARS, DukeMTMC-VideoReID and iLIDS-VID datasets, show the superiority of our approach.
Diverse part attentive network for video-based person re-identification
2021, Pattern Recognition Letters
Citation Excerpt :
Video-based person re-identification (re-ID) aims to match video tracklets of people across non-overlapping cameras. It has been studied extensively in recent years [24,31,32]. Compared with images, videos provide more temporal information about pedestrians.
Attention mechanisms have achieved success in video-based person re-identification (re-ID). However, current global attentions tend to focus on the most salient parts, e.g., clothes, and ignore other subtle but valuable cues, e.g., hair, bag, and shoes. They still do not make full use of valuable information from diverse parts of human bodies. To tackle this issue, we propose a Diverse Part Attentive Network (DPAN) to exploit discriminative and diverse body cues. The framework consists of two modules: spatial diverse part attention and temporal diverse part attention. The spatial module utilizes channel grouping to exploit diverse parts of human bodies including salient and subtle parts. The temporal module aims to learn diverse weights for fusing learned features. Besides, this framework is lightweight, which introduces marginal parameters and computational complexities. Extensive experiments were conducted on three popular benchmarks, i.e. iLIDS-VID, PRID2011 and MARS. Our method achieves competitive performance on these datasets compared with state-of-the-art methods.
Visual feature-based improved EfficientNet-GRU for Fritillariae Cirrhosae Bulbus identification
2024, Multimedia Tools and Applications
Information disentanglement based cross-modal representation learning for visible-infrared person re-identification
2023, Multimedia Tools and Applications
Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models
2023, arXiv
Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models
2022, ACM Computing Surveys

View all citing articles on Scopus

: Editor: Prof. G. Sanniti di Baja.

View full text

Image-to-video person re-identification with cross-modal embeddings

Highlights

Abstract

Introduction

Section snippets

Related work

Framework overview

Experiment

Conclusion and future work

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Image Vision Comput.

Neural Netw.

Pattern Recognit.

An improved deep learning architecture for person re-identification

Computer Vision and Pattern Recognition

Person re-identification by symmetry-driven accumulation of local features

Computer Vision and Pattern Recognition

Speech recognition with deep recurrent neural networks

IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26–31, 2013

Person re-identification by descriptive and discriminative classification

Scandinavian Conference on Image Analysis

A spatio-temporal descriptor based on 3d-gradients

Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008

Deepreid: deep filter pairing neural network for person re-identification

Computer Vision and Pattern Recognition

Learning locally-adaptive decision functions for person verification

IEEE Conference on Computer Vision and Pattern Recognition

Person re-identification by local maximal occurrence representation and metric learning

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015

Semi-supervised coupled dictionary learning for person re-identification

IEEE Conference on Computer Vision and Pattern Recognition

Recurrent convolutional network for video-based person re-identification

Computer Vision and Pattern Recognition

A reliable image-to-video person re-identification based on feature fusion

Intelligent Information and Database Systems - 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam, March 19–21, 2018, Proceedings, Part I

Local fisher discriminant analysis for pedestrian re-identification

Computer Vision and Pattern Recognition

Deep learning face representation by joint identification-verification

Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada

Rethinking the inception architecture for computer vision

2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016

2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016