Learning multi-granularity features from multi-granularity regions for person re-identification

doi:10.1016/j.neucom.2020.12.016

Neurocomputing

Volume 432, 7 April 2021, Pages 206-215

https://doi.org/10.1016/j.neucom.2020.12.016 Get rights and content

Abstract

Part-based methods for person re-identification have been widely studied. In existing part-based methods, although multiple parts are explored, only coarse-grained features of these parts are utilized. Thus, too much fine-grained information is discarded, which limits their ability to extract detailed discriminative features. To tackle this problem, we propose a novel person re-identification network to learn discriminative features across multiple granularities from body regions which are also multi-grained. Specifically, we detect multi-granularity body regions at different stages of a backbone network, and multi-granularity features are learned from body regions with corresponding granularities. To overcome the severe mismatching problem of fine-grained regions and to learn discriminative features, the detection of multi-granularity body regions and the learning of multi-granularity features are jointly optimized. This joint optimization pushes the learned features concentrating on body regions. Moreover, with the body regions well located, the multi-granularity features can be well aligned. Extensive experiments on four popular datasets show that our method is the state-of-the-art in recent years.

Introduction

Person re-identification (ReID) aims at identifying specific person from a set of surveillance cameras across time. It plays a significant role in many vision-related applications, e.g., video surveillance, content-based video retrieval, and identification from CCTV cameras. Compared to other computer vision tasks, ReID is of great challenge due to differences of background, deviations in shape, and occlusion of the subjects [1], [2].

Image representation learning plays a crucial role in person ReID. As shown in Fig. 1(a), images are usually fed to deep convolutional neural networks (CNNs) to extract the final representation. However, the final features are often too coarse and lose too much detail information. To solve this problem, many part-based models have been proposed [2], [3], [4], [5]. By learning discriminative local features as a complement to global features, they can extract additional rich features and thus achieve better ReID performance.

According to the local region generation way, part-based models can be divided into three categories: pose-based, attention-based, and stripe-based. In pose-based methods, prior knowledge, e.g., pose estimation or human segmentation, is used to locate local regions of a human body accurately [6], [2], [7], [8], [9]. These methods handle local regions of a human body by extra convolutional branches. The attention-based methods learn attention masks to select a focused foreground [10], [11], [12]. In stripe-based category, the feature maps are split into several predefined horizontal stripes [4], [13], [14], [15]. However, they all perform ReID with the features from the last layer, which have coarse granularity and contain limited local information. Furthermore, these methods are based on the assumption that person images are well aligned, so the corresponding stripes can be matched. However, misalignment is very common in person ReID.

However, methods in all three categories have one common drawback: though multiple parts/regions are explored, only coarse-grained features of these parts are utilized, as shown in Fig. 1(b). The local regions are first cropped either at the input [6] or at different stages in backbone CNNs [2], and are then fed into convolutional branches afterwards, leading to that the final features of these regions are coarse-grained. This limits the diversity and discrimination of the final features.

To tackle this problem, we propose to learn multi-granularity features from multi-granularity body regions for person ReID. We detect local regions across multiple granularities at different stages of a backbone network. As shown in Fig. 1(c), we detect four fine-grained body parts in the first stage, two body parts in the second stage, the whole body region in the third stage, and the whole image in the forth stage. For regions in each granularity, instead of feeding them to extra local branches afterwards, we directly apply a feature extraction module to learn corresponding features. In this manner, the original granularities of features are kept, and more detail information are retained. Finally, multi-granularity features learned from multi-granularity body regions are fused for person ReID. Fig. 1 clearly illustrates the difference between our model and existing part-based models.

It is worth noting that we do not simply conduct the multi-granularity region detection and multi-granularity feature learning in a straightforward two-stage manner. There are two challenging problems in our task. First, when fine-grained features are used, the misalignment becomes a big problem, especially for fine-grained regions, since the fine-grained features are extracted from shallow layers in which the receptive field is small and thus are very sensitive to translation, pose variations, etc. This may be the reason why current works only use coarse-grained features for all parts. Second, the fine-grained features are very sensitive to noises or other image content which are not helpful for ReID. Thus, we face the problem of how to ensure the extracted fine-grained features are discriminative for person ReID. To tackle these problems, we design a model to jointly optimize the multi-granularity region detection and multi-granularity feature learning.

In summary, the contributions of this paper are threefold:

•
We learn features across multiple granularities from the backbone network without feeding them to extra local branches. In this manner, the final features are diverse: both fine-grained features with rich details and coarse-grained abstract features are well reserved.
•
Our multi-granularity features are learned from multi-granularity parts. The location of multi-granularity parts and the learning of multi-granularity features are jointly optimized. This joint optimization pushes the learned features focusing on human body regions. Moreover, with the multi-granularity parts accurately located, the multi-granularity features can be well aligned, and the background noise can be well reduced.
•
The proposed method achieves the best performance on four person ReID datasets. Extensive experiments on these datasets verify the effectiveness of our approach. MGRe achieves 90.1%/96.2% mAP/Rank-1 in Market1501 and 82.0%/91.3% mAP/Rank-1 in DukeMTMC-reID.

Section snippets

Related work

Regarding discriminative feature learning for person ReID, many methods have been proposed to enhance certain regions in the feature maps. According to the region generation way, these methods can be divided into three categories: pose-based, stripe-based and attention-based.

Proposed method

In previous part-based methods, although multiple parts are utilized, only the most coarse-grained features of these parts are used to represent an image for ReID, i.e., the outputs of the last convolution layer of the feature extraction net. These coarse features are highly abstract and robust, but they discard too much detailed information. In this paper, we propose a novel method termed multiple granularities ReID (MGRe). As shown in Fig. 2, features and regions with fine-to-coarse

Datasets

We present our experiments on the following four widely used person ReID datasets.

Market1501. This dataset [42] contains 32,668 images of 1501 identities. Bounding boxes are given by a pedestrian detector of a deformable part model. The dataset is divided into a training set with 12,936 images of 751 persons and a testing set of 750 persons containing 3,368 query images and 19,732 gallery images.

DukeMTMC-reID. In this dataset [43], [44], there are 1,404 identities appearing in more than two

Conclusion

In this paper, we propose a novel multiple granularities ReID approach for learning discriminative local and global features. In MGRe, features with fine-to-coarse granularities are learned from corresponding fine-to-coarse grained body regions in different stages of the backbone network. Thus, we can obtain discriminative features where both fine-grained details and coarse-grained abstract information are learned. In addition, the location of body region and the learning of ReID features are

CRediT authorship contribution statement

Kaiwen Yang: Conceptualization, Methodology, Visualization, Software, Writing - original draft. Jiwei Yang: Data curation, Software, Investigation, Validation. Xinmei Tian: Supervision, Investigation, Funding acquisition, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 61872329.

Kaiwen Yang received the B.E. degree from the Xidian University, Xi’an, China, in 2019. He is currently working towards the master degree in the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China. His research interests lie primarily in person re-identification, representation learning and machine learning.

References (52)

F. Yang et al.
Attention driven person re-identification
Pattern Recognition
(2019)
S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognition
(2015)
X. Bai et al.
Deep-person: Learning discriminative deep features for person re-identification
Pattern Recognition
(2020)
F. Zheng et al.
Learning cross-view binary identities for fast person re-identification
H. Zhao et al.
Spindle net: Person re-identification with human body region guided feature decomposition and fusion
J. Yang et al.
Local convolutional neural networks for person re-identification
Y. Sun et al.
Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)
L. Zheng et al.
Pose invariant embedding for deep person re-identification
IEEE Transactions on Image Processing
(2019)
C. Su et al.
Pose-driven deep convolutional model for person re-identification
M.M. Kalayeh et al.
Human semantic parsing for person re-identification

J. Miao et al.

Pose-guided feature alignment for occluded person re-identification

L. He et al.

Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification

W. Li et al.

Harmonious attention network for person re-identification

C.-P. Tay et al.

Attribute attention network for person re-identifications

M. Zheng et al.

Re-identification with consistent attentive siamese networks

Y. Fu, Y. Wei, Y. Zhou, H. Shi, G. Huang, X. Wang, Z. Yao, T. Huang, Horizontal pyramid matching for person...

G. Wang et al.

Learning discriminative features with multiple granularities for person re-identification

F. Zheng et al.

Pyramidal person re-identification via multi-loss dynamic training

W. Li et al.

Person re-identification by deep joint learning of multi-loss classification

M. Saquib Sarfraz et al.

A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking

E. Ustinova et al.

Multi-region bilinear convolutional neural networks for person re-identification

S. Gao et al.

Pose-guided visible part matching for occluded person reid

R. Zhao et al.

Unsupervised salience learning for person re-identification

G. Wang et al.

High-order information matters: learning relation and topology for occluded person re-identification

R. Hou et al.

Interaction-and-aggregation network for person re-identification

X. Chen et al.

Salience-guided cascaded suppression network for person re-identification

Cited by (2)

Information complementary attention-based multidimension feature learning for person re-identification
2023, Engineering Applications of Artificial Intelligence
With the need for criminal investigation technology and the development of deep learning, the task of person re-identification has gradually become a research hotspot. Recently, various neural network-based person re-identification technologies designed by researchers have shown excellent results. However, most of the frameworks focus on complex structural design or redundant networks to guide model construction, which hugely increases the cost of train and application cost. In addition, the correlation between the channel information and spatial information on the pedestrian feature map is also relatively lacking. Therefore, we design a lightweight attention module to address the lack of correlation question response. The proposed module sequentially extracts person images’ channel and spatial features and effectively associates the two kinds of information through sequential connections. The proposed attention module has a simple structure, and the parameter increase in the backbone network is tiny. We place the fuse module in each feature extraction layer to focus on the pedestrian information extracted by each layer. To solve the problem of complex model structure, we choose the residual network as the backbone network and the attention mechanism to extract person features without using pose point estimation or additional network assistance to reduce model complexity. We adjust the drop rate of the person classification layer to improve the model’s generalization ability. We estimate the performance of our method on three public datasets: Market-1501, DukeMTMC-reID, and CUHK03 (both detected and labeled) demonstrate the proposed method’s effectiveness and obtain highly competitive performance on the three datasets.
Infrared-visible cross-modal person re-identification via dual-attention collaborative learning
2022, Signal Processing: Image Communication
Person re-identification is regarded as a retrieval task for searching the same person in different cameras, within which infrared-visible cross-modal re-identification (VI-ReID) is challenging because the inter-class distance is larger than the intra-class distance. In this paper, a dual-attention collaborative(DAC) learning method is proposed, which unites channel and spatial attentive deep features to obtain supplementary information for multiple classifiers via a cross-modal consistency constraint. A channel attention and part-wise spatial pooling are adopted for discriminative feature learning. A multiple-classifier strategy with a cross-modal consistency constraints is presented for the cross-modal identification. In this way complementary information among modality-sharable classifier and modality-specific classifier can be better utilized. The experimental results show that the proposed method distinctly outperforms the baseline method by a margin of 9.83% Rank-1 and 6.84% mAP on SYSU-MM01.

Jiwei Yang received the PhD degree in information and communication engineering from University of Science and Technology of China, Hefei, China, in 2019. He is currently an algorithm engineer at Huawei, China. His current research interests include machine learning and its applications to computer vision.

Xinmei Tian received the B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2005 and 2010, respectively. She is an Associate Professor in the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China. Her current research interests include multimedia information retrieval and machine learning. She received the Excellent Doctoral Dissertation of Chinese Academy of Sciences award in 2012 and the Nomination of National Excellent Doctoral Dissertation award in 2013.

View full text

Learning multi-granularity features from multi-granularity regions for person re-identification

Abstract

Introduction

Section snippets

Related work

Proposed method

Datasets

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Pattern Recognition

Pattern Recognition

Pattern Recognition

Learning cross-view binary identities for fast person re-identification

Spindle net: Person re-identification with human body region guided feature decomposition and fusion

Local convolutional neural networks for person re-identification

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Pose invariant embedding for deep person re-identification

IEEE Transactions on Image Processing

Pose-driven deep convolutional model for person re-identification

Human semantic parsing for person re-identification

Pose-guided feature alignment for occluded person re-identification

Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification

Harmonious attention network for person re-identification

Attribute attention network for person re-identifications

Re-identification with consistent attentive siamese networks

Learning discriminative features with multiple granularities for person re-identification

Pyramidal person re-identification via multi-loss dynamic training

Person re-identification by deep joint learning of multi-loss classification

A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking

Multi-region bilinear convolutional neural networks for person re-identification

Pose-guided visible part matching for occluded person reid

Unsupervised salience learning for person re-identification

High-order information matters: learning relation and topology for occluded person re-identification

Interaction-and-aggregation network for person re-identification

Salience-guided cascaded suppression network for person re-identification