Image-to-video person re-identification with cross-modal embeddings

https://doi.org/10.1016/j.patrec.2019.03.003Get rights and content

Highlights

  • We propose an end-to-end cross-modal framework for image-to-video re-id.

  • Cross-modal embeddings for related tasks are integrated to learn features.

  • Experiments results well verify the effectiveness of proposed framework.

Abstract

Despite the great progress achieved, image-to-video person re-identification is still challenging in the cross-modal scenario. Currently, state-of-the-art approaches mainly concentrate on the task-specific data, neglecting the extra information from the different but related tasks. In this paper, we propose an end-to-end neural network framework for image-to-video person re-identification with cross-modal embeddings learned from extra information. Concretely speaking, cross-modal embedding layers from image captioning and video captioning models, are incorporated to learn common latent embeddings for multiple modalities. The learned multimodal embeddings are expected to focus on person’s prominent distinctions, due to textual descriptive information generally paying close attention to person’s explicit characteristics. Apart from that, our proposed framework resorts to CNNs and LSTMs for extracting visual and spatiotemporal features, and combines the strengths of identification and verification model to improve the discriminative ability of the learned features. The experimental results demonstrate the effectiveness of our framework on narrowing down the gap between heterogeneous data and obtaining observable improvement in the image-to-video person re-identification task.

Introduction

Person re-identification(re-id) is a task of recognizing people across images and videos from non-overlapping camera views. With widespread use of surveillance cameras and enhanced awareness of public security, person re-id has attracted especial attention of computer vision and pattern recognition communities [25], [28].

In general, there are two major types of deep learning models for person re-identification, i.e. verification models and identification models. Verification models take a pair of data as input and determine whether they belong to the same person or not. They only leverage weak re-id labels and can be regarded as a binary-class classification or similarity regression task [33]. While identification models aim at feature learning by treating person re-identification as a multi-class classification task [30], but lack direct similarity measurement between input pair. Due to the complementary advantages and limitations, the two models are combined to improve the performance in the homogeneous re-id task [11], [38]. To our best knowledge, the combination haven’t been applied in the image-to-video re-id yet.

In the image-to-video person re-identification task, a cross-modal system, the main challenge is how to match between different modalities, i.e. image and video. Directly using the information provided by target task cannot perfectly bridge the “media gap”, which means that representations of different modalities are inconsistent.

For the purpose of addressing the limitations, we propose a novel end-to-end framework for image-to-video person re-identification by leveraging cross-modal embeddings from different but related tasks. Concretely speaking, the proposed framework can be divided into feature representation sub-network and verification-identification sub-network. The former consists of CNNs and LSTMs for extracting image feature and video spatiotemporal feature, and the latter combines the strengths of identification model and verification model to improve the discriminative ability of the learned feature representations.

Furthermore, cross-modal embeddings, i.e. image-to-text and video-to-text embedding layers are integrated into feature representation sub-network to learn common latent embeddings for video-text and image-text modalities. Textual descriptive information mainly focuses on the person’s prominent distinctions and neglects the details of background, which are useless for person re-id. Hence, the learned multimodal embeddings would pay attention to similar details, which can help image-to-video person re-id narrow down the “media gap” and improve the performance in the subsequent process. For example, an image depicting a woman with a brown hat easily matches the video depicting a woman with a black hat in the traditional person re-id implementation, which can be overcome with the help of textual descriptive information generated by the corresponding cross-modal embedding layers.

Our main contributions can be summarized as follows.

  • We propose an end-to-end cross-modal framework for image-to-video person re-id, which leverages CNNs and LSTMs to extract the visual features and spatiotemporal motion features in the feature representation sub-network. Additionally, identification loss and verification loss are combined in the verification-identification sub-network so that the framework can simultaneously learn discriminative feature representations and a similarity metric.

  • The feature representation sub-network in our proposed framework integrates cross-modal embeddings for different but related tasks, to learn the common latent embeddings for video-text and image-text modalities. In this way, text information containing person’s explicit characteristics can be incorporated and the learned common latent embeddings would pay more attention to person’s prominent distinctions in the image or video.

  • We have conducted several experiments on two publicly available person image sequence datasets. The person re-identification results on the PRID-2011 and ILDS-VID benchmarks well verify the effectiveness of our proposed person re-identification framework by comparing with several existing state-of-the-art approaches.

The remainder of this paper is organized as follows. Section 2 briefly reviews some related work about person re-identification in recent years. Then we elaborate on the proposed framework for image-to-video person re-id and present its each part at length in Section 3, followed by experiments in Section 4 together with conclusions and future work in Section 5.

Section snippets

Related work

In this section, we briefly introduce some previous works related to person re-identification and the framework proposed in this paper.

Framework overview

Image-to-video person re-identification problem can be formulated as follows:

Problem formulation for image-to-video person re-id
Input:
Probe image for a specific pedestrian
Gallery consisting of videos captured by non-overlapping
cameras
Output:
Video in the gallery for the same pedestrian as the probe
image

When dealing with the image-to-video person re-identification task, two major challenges will be inevitably encountered, i.e. how to represent the features of image and video, and how to train the

Experiment

In this section, we mainly evaluate the effectiveness of our proposed framework in the image-to-video person re-identification task. The datasets will be first introduced, followed by evaluation, and the comparison with several state-of-the-art approaches for image-to-video person re-identification will also be presented.

Conclusion and future work

In this paper, we present a novel end-to-end framework for image-to-video person re-identification, which extracts visual and spatiotemporal information from image and video by LSTMs and CNNs, and takes advantage of the strengths of identification model and verification model to improve the discriminative ability of learned representations. What’s more, cross-model embedding layers from different but related tasks are integrated into our framework to facilitate the learned representations to

Declaration of Competing Interest

None.

Acknowledgments

This work is supported by the National Social Science Foundation of China (Grant No. 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).

References (38)

  • W. Li et al.

    Deepreid: deep filter pairing neural network for person re-identification

    Computer Vision and Pattern Recognition

    (2014)
  • Z. Li et al.

    Learning locally-adaptive decision functions for person verification

    IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • S. Liao et al.

    Person re-identification by local maximal occurrence representation and metric learning

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015

    (2015)
  • X. Liu et al.

    Semi-supervised coupled dictionary learning for person re-identification

    IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • N. Mclaughlin et al.

    Recurrent convolutional network for video-based person re-identification

    Computer Vision and Pattern Recognition

    (2016)
  • B. Nguyen et al.

    A reliable image-to-video person re-identification based on feature fusion

    Intelligent Information and Database Systems - 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam, March 19–21, 2018, Proceedings, Part I

    (2018)
  • S. Pedagadi et al.

    Local fisher discriminant analysis for pedestrian re-identification

    Computer Vision and Pattern Recognition

    (2013)
  • Y. Sun et al.

    Deep learning face representation by joint identification-verification

    Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada

    (2014)
  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016

    (2016)
  • Cited by (9)

    • Image-to-video person re-identification using three-dimensional semantic appearance alignment and cross-modal interactive learning

      2022, Pattern Recognition
      Citation Excerpt :

      The results of the TKP method reported in Table 5 are first pre-trained on the large-scale MARS dataset following [19]. From Table 5, it can be seen that the proposed method can overhead many I2V ReID methods, like MPHDL [25], TMSL [17], P2SNet [16] and Xie et al. [23]. From the comparisons between the baseline and our method, it can be observed that though our method does not achieve the large performance gain, the results are still competitive.

    • Diverse part attentive network for video-based person re-identification

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Video-based person re-identification (re-ID) aims to match video tracklets of people across non-overlapping cameras. It has been studied extensively in recent years [24,31,32]. Compared with images, videos provide more temporal information about pedestrians.

    View all citing articles on Scopus

    Editor: Prof. G. Sanniti di Baja.

    View full text