RGB-IR cross-modality person ReID based on teacher-student GAN model

doi:10.1016/j.patrec.2021.07.006

Pattern Recognition Letters

Volume 150, October 2021, Pages 155-161

https://doi.org/10.1016/j.patrec.2021.07.006 Get rights and content

Highlights

•
Teacher-Student model to minimize the different modality gap.
•
Joint cycle-consistency GAN to generate the corresponding image pairs.
•
Only use main backbone in the test stage for efficiency.
•
Numerous experiments conducted have proven the effectiveness of proposed model.

Abstract

RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model’s capability, we did extensive experiments on the newly-released SYSU-MM01 and RegDB RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 47.4% mAP and 69.4% mAP respectively.

Introduction

Person ReID is also referring to as pedestrian ReID ([29]). It is designed to match specific pedestrians in images or video sequences. The main challenge of ReID is that the intra-class (same person in different situations) variations are usually significant due to the changes in camera viewing conditions, such as viewpoint or situation differences, which makes it challenging to identify the same person. Meanwhile, the inter-class (different people in the same situation) variations also influence ReID performance.

In recent years, most existing works in person ReID are to learn discriminative features of person identity by a specifically designed backbone model ([1], [21], [22]). There are also works focusing on problems of occlusions ([11], [37]), different poses ([17]), illumination changes ([31]), lack of labels ([23], [24]) and resolution changes ([12]). These works are under the RGB-RGB camera setting. Both query images and the gallery images are in RGB mode.

However, RGB-RGB camera ReID is greatly restricted when the light condition is weak or unavailable. A person may appear in one camera during the day and reemerge in another camera at night. In such a case, the RGB images captured at night by an RGB camera will have little effective information in ReID because of the darkness. As shown in Fig 1, human eyes can hardly get any person identity information in the images which have lots of noise as well.

As known to all, an infrared camera forms a grey image (single channel image) using infrared radiation, which can increase in-the-dark visibility without actually using a visible light source. Thus, using both RGB and IR images will complement each other and enhance person ReID performance. As shown in Fig 2, the query images are all under IR modality, providing much more information than those in Fig 1.

However, few researchers have studied such RGB-IR cross-modality person ReID. The main challenge is that different pedestrians can appear to be very similar in the same modality, while the same pedestrian under different modalities can look quite different. Another challenge is that IR images only have grey-scale pixels, which provide much less information compared to RGB images, making it more challenging to extract effective features for the task of ReID.

In this paper, we propose a Teacher-Student GAN based cross-modality person ReID model (TS-GAN). The critical insight of our approach is that we design a novel network in which the Student ReID module is guided by a pretrained Teacher module to encourage the closeness between RGB and IR ReID features. It tremendously improves the quality of features obtained for ReID classification and reduces the gap between different modalities. To improve the model’s effectiveness, we have added several innovations summarised as follows:

1.
IR ReID Teacher module is pretrained using Real IR images in the train-set, which obtains very high accuracy. We then use it as the teacher to guide the feature learning in the Student ReID module.
2.
We use joint cycle-consistency GAN with joint discriminator to generate the corresponding Fake IR person images from the input Real RGB person images, thus obtain pair-wise person images under different modalities. The (Real RGB, Fake IR) image pairs are then used to train the Student ReID module with MSE reconstruction loss such that the cross-modality gap can be reduced.
3.
To enhance the feature extraction ability of the Student module, we also use another two MSE losses. One is between Real IR image features from the Teacher module and those from the Student module. The other one is between Fake IR image features from both modules.
4.
Unlike other GAN based cross-modality ReID methods, our model only requires the GAN module at the train stage. During testing, it is more resource-saving and efficient to feed-forward through the ReID backbone module without the involvement of GAN.

Section snippets

RGB-RGB person ReID

Most researchers focused on traditional RGB-RGB person ReID. One primary method of person ReID is metric learning, which is to formalize the problem as supervised metric learning where a projection matrix is sought out ([5], [28], [30]). Another primary method is to learn appropriate features associated with the same ID using features distance information ([9]) on a backbone module ([1], [18]), such as Resnet50 ([8]). All of these works focus on RGB-RGB person ReID, which may fail in some

Overview

The overall model structure for our TS-GAN model is shown in Fig 3. The whole model consists of three main parts, which are: (1) RGB-IR image generation module, (2) ReID backbone and (3) RGB-IR TS module. We use subscripts “S” and “T” to distinguish blocks belonging to the Student or Teacher modules.

As shown in Fig 3, $G_{I}$ and $D_{I}$ (in the green dotted rectangle) are the generator and joint discriminator for IR images. $E_{S}$ (in the red dotted rectangle) is the ReID feature encoder. $E_{T}$ (in the blue

Dataset and evaluation protocol

SYSU-MM01 ([27]) is the most popular and newest dataset in RGB-IR cross-modality person ReID. It contains images captured by six cameras, including two IR cameras and four RGB ones. RegDB ([15]) is collected by dual camera systems. It contains 412 identities and each identity has 10 different thermal (IR) images and 10 different visible (RGB) images.

Our experiments follow the standard evaluation protocol in existing RGB-IR cross-modality ReID methods. For SYSU-MM01, There are two evaluation

Conclusion

In this paper, we proposed a novel TS-GAN model to learn the common representation features for RGB-IR cross-modality person images. We designed the IR joint GAN and IR Teacher model to enhance Student ReID backbone and reduce the domain gap of inputs from different modalities. Comprehensive experiments on challenging cross-modality person ReID datasets, SYSU-MM01 and RegDB, have demonstrated that our approach outperforms the state-of-the-art methods regarding ReID accuracy.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (37)

S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognit.
(2015)
X. Chang et al.
Multi-level factorisation net for person re-identification
CVPR
(2018)
Y. Chen et al.
Darkrank: accelerating deep metric learning via cross sample similarities transfer
AAAI
(2018)
P. Dai et al.
Cross-modality person re-identification with generative adversarial training.
IJCAI
(2018)
N. Dalal et al.
Histograms of oriented gradients for human detection
CVPR
(2005)
I. Goodfellow et al.
Generative adversarial nets
Y. Hao et al.
Hsme: hypersphere manifold embedding for visible thermal person re-identification
Proceedings of the AAAI Conference on Artificial Intelligence
(2019)
K. He et al.
Deep residual learning for image recognition
CVPR
(2016)
A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification,...
G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network,...

R. Hou et al.

Vrstc: occlusion-free video person re-identification

CVPR

(2019)

Y.-J. Li et al.

Recover and identify: a generative dual model for cross-resolution person re-identification

ICCV

(2019)

S. Liao et al.

Person re-identification by local maximal occurrence representation and metric learning

CVPR

(2015)

H. Luo et al.

Bag of tricks and a strong baseline for deep person re-identification

CVPR Workshops

(2019)

D.T. Nguyen et al.

Person recognition system based on a combination of body images from visible light and thermal cameras

Sensors

(2017)

A. Paszke et al.

Pytorch: an imperative style, high-performance deep learning library

NeurIPS

(2019)

X. Qian et al.

Pose-normalized image generation for person re-identification

ECCV

(2018)

J. Si et al.

Dual attention matching network for context-aware feature sequence based person re-identification

CVPR

(2018)

Cited by (33)

Cross-modality person re-identification based on intermediate modal generation
2024, Optics and Lasers in Engineering
In the context of cross-modal person re-identification, researchers often employ methods that utilize visible modality information to generate both an ‘X’ modality and a grayscale modality, enhancing the accuracy of person re-identification models. A lightweight network causes the ‘X’ modality through self-supervised learning of labels from visible images. In contrast, the grayscale modality is obtained through simple linear accumulation of the three RGB color channels from visual images. It can be observed that both the ‘X’ modality and grayscale modality are derived from visible images, which fails to establish a connection between the visible and infrared modalities. Therefore, this paper proposes an intermediate modality generation module to produce intermediate modality representations dynamically. By combining information from the visible, infrared, and intermediate modalities, the model is encouraged to capture modality-invariant features with cross-modal consistency. This enables person of the same identity to exhibit similar feature representations across different modalities, thereby mitigating the impact of distribution differences between visible and infrared modalities. Additionally, to facilitate the learning of appropriate intermediate modality representations, a distribution migration strategy is introduced. It guides the intermediate modality to maintain the correct distance from the visible and infrared modalities by optimizing the weights of the loss functions, preventing inadequate feature learning caused by an excessive focus on specific modality information. Furthermore, a mixed augmentation approach is proposed to alleviate disparities among multiple modalities further. By randomly cropping and combining regions of visible (infrared) modality images with infrared (visible) modality images, the generalization ability of the model in heterogeneous modalities is enhanced. Extensive comparative experiments are conducted on the SYSU-MM01 and RegDB datasets, yielding mAP values of 57.2% and 85.82%, respectively. The superior mAP performance on the RegDB dataset compared to most existing methods validates the effectiveness of the proposed approach.
RGB-T image analysis technology and application: A survey
2023, Engineering Applications of Artificial Intelligence
RGB-Thermal infrared (RGB-T) image analysis has been actively studied in recent years. In the past decade, it has received wide attention and made a lot of important research progress in many applications. This paper provides a comprehensive review of RGB-T image analysis technology and application, including several hot fields: image fusion, salient object detection, semantic segmentation, pedestrian detection, object tracking, and person re-identification. The first two belong to the preprocessing technology for many computer vision tasks, and the rest belong to the application direction. This paper extensively reviews 400+ papers spanning more than 10 different application tasks. Furthermore, for each specific task, this paper comprehensively analyzes the various methods and presents the performance of the state-of-the-art methods. This paper also makes an in-deep analysis of challenges for RGB-T image analysis as well as some potential technical improvements in the future.
Deep learning for visible-infrared cross-modality person re-identification: A comprehensive review
2023, Information Fusion
Citation Excerpt :
Specifically, AlignGAN first achieved the pixel-level alignment by generating the fake infrared images from the real visible images, and then matched the generated fake IR images and real IR images via a feature alignment module. After that, Zhang et al. [89] proposed a teacher–student GAN model (TS-GAN), which generated the fake IR images from existing visible images to reduce cross-modality variations and guide the extraction of discriminative person features. Differently, some works employ the way of generating fake visible images from the real infrared images for compensation.
Visible-infrared cross-modality person re-identification (VI-ReID) is currently a prevalent but challenging research topic in computer vision, since it can remedy the poor performance of existing single-modality ReID models under insufficient illumination, thus enabling the 24/7 surveillance systems. Although extensive research efforts have been dedicated to VI-ReID, a systematic and comprehensive literature review is still missing. Considering that, in this paper, a comprehensive review of VI-ReID approaches is provided. First, we clarify the importance, definition and challenges of VI-ReID. Secondly and most importantly, we elaborately analyze the motivations and the methodologies of existing VI-ReID methods. Accordingly, we will provide a comprehensive taxonomy, including 4 categories with 8 sub-items, for those state-of-the-art (SOTA) VI-ReID models. After that, we elaborate on some widely used datasets and evaluation metrics. Next, comprehensive comparisons of SOTA methods are made on the benchmark datasets. Based on the results, we point out the limitations of current methods. At last, we outline the challenges in this field and future research trends.
Channel exchange and adversarial learning guided cross-modal person re-identification
2022, Knowledge-Based Systems
The extensive progress of Re-ID has been obtained in the visible modality. Because the security monitoring system automatically switches from visible modality to infrared modality in the dark situation, the research of the infrared-visible modality-based cross-modal person Re-ID (IV-Re-ID) task increases much attention. However, the existing heterogeneous physical properties result in a substantial semantic gap between the visible and infrared modality data, further increasing the challenge of IV-Re-ID. This paper proposes a Cross-modal Channel Exchange Network (CmCEN) for the task of IV-Re-ID. First, a non-local attention mechanism and a semi-weight share mechanism are utilized in the backbone network to enhance the discriminative capability of both local and global data representations. Then, a channel exchange module is designed to measure each channel’s significance effectively, and the useless channels could be replaced by the critical channels from the other modality data. Finally, a discriminator with adversarial loss is designed in the generative adversarial module. It can generate a similar distribution of two different modality images with the same person identity and a different distribution of two different modality images with different identities under the supervision of the adversarial loss to further improve the robustness of the learned latent feature space. The evaluation results on two cross-modal datasets demonstrated that the CmCEN achieves competitively and even higher performance comparing the SOTA models in terms of accuracy and mAP of IV-Re-ID.
Cross-modality disentanglement and shared feedback learning for infrared-visible person re-identification
2022, Knowledge-Based Systems
Citation Excerpt :
However, they do not consider specific information in visible images, such as the colors of clothes and bags. The schemes proposed in literature [17,18] achieve a unified representation of images with two modalities. Choi et al. [19] aim to simultaneously disentangle ID-excluded factors and ID-discriminative factors from the attribute encoder to generate cross-modality images with different poses and illumination.
Infrared-visible person re-identification (IV-ReID) has become a research hotspot in the field of computer vision. Compared with traditional person re-identification, the IV-ReID task is still very challenging due to huge difference between modalities. Most existing approaches are designed to bridge the cross-modality gap through single feature-level constraints, but the results are not very satisfactory. To this end, a novel cross-modality disentanglement and shared feedback (CMDSF) learning framework is proposed. The framework consists of a cross-modality images disentanglement network (CMIDN) and a dual-path shared feedback learning network (DSFLN). Specifically, the former uses a pairing strategy to more efficiently disentangle the cross-modality features and constrain the feature distribution distances between modalities. It achieves modality-level alignment while maintaining specific identity-consistency. The latter adopts a dual-path shared module (DSM) to obtain discriminative mid-level feature information, and achieves feature-level alignment. Furthermore, a feedback scoring module (FSM) with a negative feedback mechanism is proposed to compensate for the weak supervision defect of objective loss during backpropagation. It optimizes model parameters by providing a strong feedback signal. In summary, we propose an efficient learning framework with two parts jointly trained and optimized in an end-to-end manner. Extensive experimental results on two cross-modality datasets demonstrate that our method achieves a competitive performance compared with the state-of-the-art methods.
Semantic consistent feature construction and multi-granularity feature learning for visible-infrared person re-identification
2024, Visual Computer

View all citing articles on Scopus

View full text

RGB-IR cross-modality person ReID based on teacher-student GAN model

Highlights

Abstract

Introduction

Section snippets

RGB-RGB person ReID

Overview

Dataset and evaluation protocol

Conclusion

Declaration of Competing Interest

Pattern Recognit.

Multi-level factorisation net for person re-identification

CVPR

Darkrank: accelerating deep metric learning via cross sample similarities transfer

AAAI

Cross-modality person re-identification with generative adversarial training.

IJCAI

Histograms of oriented gradients for human detection

CVPR

Generative adversarial nets

Hsme: hypersphere manifold embedding for visible thermal person re-identification

Proceedings of the AAAI Conference on Artificial Intelligence

Deep residual learning for image recognition

CVPR

Vrstc: occlusion-free video person re-identification

CVPR

Recover and identify: a generative dual model for cross-resolution person re-identification

ICCV

Person re-identification by local maximal occurrence representation and metric learning

CVPR

Bag of tricks and a strong baseline for deep person re-identification

CVPR Workshops

Person recognition system based on a combination of body images from visible light and thermal cameras

Sensors

Pytorch: an imperative style, high-performance deep learning library

NeurIPS

Pose-normalized image generation for person re-identification

ECCV

Dual attention matching network for context-aware feature sequence based person re-identification

CVPR