Elsevier

Neurocomputing

Volume 432, 7 April 2021, Pages 206-215
Neurocomputing

Learning multi-granularity features from multi-granularity regions for person re-identification

https://doi.org/10.1016/j.neucom.2020.12.016Get rights and content

Abstract

Part-based methods for person re-identification have been widely studied. In existing part-based methods, although multiple parts are explored, only coarse-grained features of these parts are utilized. Thus, too much fine-grained information is discarded, which limits their ability to extract detailed discriminative features. To tackle this problem, we propose a novel person re-identification network to learn discriminative features across multiple granularities from body regions which are also multi-grained. Specifically, we detect multi-granularity body regions at different stages of a backbone network, and multi-granularity features are learned from body regions with corresponding granularities. To overcome the severe mismatching problem of fine-grained regions and to learn discriminative features, the detection of multi-granularity body regions and the learning of multi-granularity features are jointly optimized. This joint optimization pushes the learned features concentrating on body regions. Moreover, with the body regions well located, the multi-granularity features can be well aligned. Extensive experiments on four popular datasets show that our method is the state-of-the-art in recent years.

Introduction

Person re-identification (ReID) aims at identifying specific person from a set of surveillance cameras across time. It plays a significant role in many vision-related applications, e.g., video surveillance, content-based video retrieval, and identification from CCTV cameras. Compared to other computer vision tasks, ReID is of great challenge due to differences of background, deviations in shape, and occlusion of the subjects [1], [2].

Image representation learning plays a crucial role in person ReID. As shown in Fig. 1(a), images are usually fed to deep convolutional neural networks (CNNs) to extract the final representation. However, the final features are often too coarse and lose too much detail information. To solve this problem, many part-based models have been proposed [2], [3], [4], [5]. By learning discriminative local features as a complement to global features, they can extract additional rich features and thus achieve better ReID performance.

According to the local region generation way, part-based models can be divided into three categories: pose-based, attention-based, and stripe-based. In pose-based methods, prior knowledge, e.g., pose estimation or human segmentation, is used to locate local regions of a human body accurately [6], [2], [7], [8], [9]. These methods handle local regions of a human body by extra convolutional branches. The attention-based methods learn attention masks to select a focused foreground [10], [11], [12]. In stripe-based category, the feature maps are split into several predefined horizontal stripes [4], [13], [14], [15]. However, they all perform ReID with the features from the last layer, which have coarse granularity and contain limited local information. Furthermore, these methods are based on the assumption that person images are well aligned, so the corresponding stripes can be matched. However, misalignment is very common in person ReID.

However, methods in all three categories have one common drawback: though multiple parts/regions are explored, only coarse-grained features of these parts are utilized, as shown in Fig. 1(b). The local regions are first cropped either at the input [6] or at different stages in backbone CNNs [2], and are then fed into convolutional branches afterwards, leading to that the final features of these regions are coarse-grained. This limits the diversity and discrimination of the final features.

To tackle this problem, we propose to learn multi-granularity features from multi-granularity body regions for person ReID. We detect local regions across multiple granularities at different stages of a backbone network. As shown in Fig. 1(c), we detect four fine-grained body parts in the first stage, two body parts in the second stage, the whole body region in the third stage, and the whole image in the forth stage. For regions in each granularity, instead of feeding them to extra local branches afterwards, we directly apply a feature extraction module to learn corresponding features. In this manner, the original granularities of features are kept, and more detail information are retained. Finally, multi-granularity features learned from multi-granularity body regions are fused for person ReID. Fig. 1 clearly illustrates the difference between our model and existing part-based models.

It is worth noting that we do not simply conduct the multi-granularity region detection and multi-granularity feature learning in a straightforward two-stage manner. There are two challenging problems in our task. First, when fine-grained features are used, the misalignment becomes a big problem, especially for fine-grained regions, since the fine-grained features are extracted from shallow layers in which the receptive field is small and thus are very sensitive to translation, pose variations, etc. This may be the reason why current works only use coarse-grained features for all parts. Second, the fine-grained features are very sensitive to noises or other image content which are not helpful for ReID. Thus, we face the problem of how to ensure the extracted fine-grained features are discriminative for person ReID. To tackle these problems, we design a model to jointly optimize the multi-granularity region detection and multi-granularity feature learning.

In summary, the contributions of this paper are threefold:

  • We learn features across multiple granularities from the backbone network without feeding them to extra local branches. In this manner, the final features are diverse: both fine-grained features with rich details and coarse-grained abstract features are well reserved.

  • Our multi-granularity features are learned from multi-granularity parts. The location of multi-granularity parts and the learning of multi-granularity features are jointly optimized. This joint optimization pushes the learned features focusing on human body regions. Moreover, with the multi-granularity parts accurately located, the multi-granularity features can be well aligned, and the background noise can be well reduced.

  • The proposed method achieves the best performance on four person ReID datasets. Extensive experiments on these datasets verify the effectiveness of our approach. MGRe achieves 90.1%/96.2% mAP/Rank-1 in Market1501 and 82.0%/91.3% mAP/Rank-1 in DukeMTMC-reID.

Section snippets

Related work

Regarding discriminative feature learning for person ReID, many methods have been proposed to enhance certain regions in the feature maps. According to the region generation way, these methods can be divided into three categories: pose-based, stripe-based and attention-based.

Proposed method

In previous part-based methods, although multiple parts are utilized, only the most coarse-grained features of these parts are used to represent an image for ReID, i.e., the outputs of the last convolution layer of the feature extraction net. These coarse features are highly abstract and robust, but they discard too much detailed information. In this paper, we propose a novel method termed multiple granularities ReID (MGRe). As shown in Fig. 2, features and regions with fine-to-coarse

Datasets

We present our experiments on the following four widely used person ReID datasets.

Market1501. This dataset [42] contains 32,668 images of 1501 identities. Bounding boxes are given by a pedestrian detector of a deformable part model. The dataset is divided into a training set with 12,936 images of 751 persons and a testing set of 750 persons containing 3,368 query images and 19,732 gallery images.

DukeMTMC-reID. In this dataset [43], [44], there are 1,404 identities appearing in more than two

Conclusion

In this paper, we propose a novel multiple granularities ReID approach for learning discriminative local and global features. In MGRe, features with fine-to-coarse granularities are learned from corresponding fine-to-coarse grained body regions in different stages of the backbone network. Thus, we can obtain discriminative features where both fine-grained details and coarse-grained abstract information are learned. In addition, the location of body region and the learning of ReID features are

CRediT authorship contribution statement

Kaiwen Yang: Conceptualization, Methodology, Visualization, Software, Writing - original draft. Jiwei Yang: Data curation, Software, Investigation, Validation. Xinmei Tian: Supervision, Investigation, Funding acquisition, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 61872329.

Kaiwen Yang received the B.E. degree from the Xidian University, Xi’an, China, in 2019. He is currently working towards the master degree in the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China. His research interests lie primarily in person re-identification, representation learning and machine learning.

References (52)

  • J. Miao et al.

    Pose-guided feature alignment for occluded person re-identification

  • L. He et al.

    Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification

  • W. Li et al.

    Harmonious attention network for person re-identification

  • C.-P. Tay et al.

    Attribute attention network for person re-identifications

  • M. Zheng et al.

    Re-identification with consistent attentive siamese networks

  • Y. Fu, Y. Wei, Y. Zhou, H. Shi, G. Huang, X. Wang, Z. Yao, T. Huang, Horizontal pyramid matching for person...
  • G. Wang et al.

    Learning discriminative features with multiple granularities for person re-identification

  • F. Zheng et al.

    Pyramidal person re-identification via multi-loss dynamic training

  • W. Li et al.

    Person re-identification by deep joint learning of multi-loss classification

  • M. Saquib Sarfraz et al.

    A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking

  • E. Ustinova et al.

    Multi-region bilinear convolutional neural networks for person re-identification

  • S. Gao et al.

    Pose-guided visible part matching for occluded person reid

  • R. Zhao et al.

    Unsupervised salience learning for person re-identification

  • G. Wang et al.

    High-order information matters: learning relation and topology for occluded person re-identification

  • R. Hou et al.

    Interaction-and-aggregation network for person re-identification

  • X. Chen et al.

    Salience-guided cascaded suppression network for person re-identification

  • Cited by (2)

    Kaiwen Yang received the B.E. degree from the Xidian University, Xi’an, China, in 2019. He is currently working towards the master degree in the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China. His research interests lie primarily in person re-identification, representation learning and machine learning.

    Jiwei Yang received the PhD degree in information and communication engineering from University of Science and Technology of China, Hefei, China, in 2019. He is currently an algorithm engineer at Huawei, China. His current research interests include machine learning and its applications to computer vision.

    Xinmei Tian received the B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2005 and 2010, respectively. She is an Associate Professor in the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China. Her current research interests include multimedia information retrieval and machine learning. She received the Excellent Doctoral Dissertation of Chinese Academy of Sciences award in 2012 and the Nomination of National Excellent Doctoral Dissertation award in 2013.

    View full text