Dynamic imposter based online instance matching for person search

doi:10.1016/j.patcog.2019.107120

Pattern Recognition

Volume 100, April 2020, 107120

https://doi.org/10.1016/j.patcog.2019.107120 Get rights and content

Abstract

Person search aims to locate the target person matching a given query from a list of unconstrained whole images. It is a challenging task due to the unavailable bounding boxes of pedestrians, limited samples for each labeled identity and large amount of unlabeled persons in existing datasets. To address these issues, we propose a novel end-to-end learning framework for person search. The proposed framework settles pedestrian detection and person re-identification concurrently. To achieve the goal of co-learning and utilize the information of unlabeled persons, a novel yet extremely efficient Dynamic Imposter based Online Instance Matching (DI-OIM) loss is formulated. The DI-OIM loss is inspired by the observation that pedestrians appearing in the same image obviously have different identities. Thus we assign the unlabeled persons with dynamic pseudo-labels. The pseudo-labeled persons along with the labeled persons can be used to learn powerful feature representations. Experiments on CUHK-SYSU and PRW datasets demonstrate that our method outperforms other state-of-the-art algorithms. Moreover, it is superior and efficient in terms of memory capacity comparing with existing methods.

Introduction

Person re-identification is the task of searching person-of-interest across non-overlapping camera views[1]. It has attracted growing research interests for its great value of applications in criminal spotting [2], multi-pedestrian tracking [3] and intelligent security [4]. Numerous endeavors on person re-identification have been made over recent decades [5], [6]. However, it is still far from applying current person re-identification techniques into practical intelligent monitoring systems. One of the key reasons is that typical re-identification systems assume that the person images must be well cropped and aligned from the scene images. While in real-world applications, we usually need to find a target person from the whole images or video frames without available pedestrian boxes.

Person search is a new valuable topic that bridges the gap between person re-identification and the real-world applications [7], [8]. We illustrate the difference between person search and conventional re-identification in Fig. 1. The new task requires a close cooperation between the detector and the identifier. Recently, great efforts have been poured into person search. The technique roots can be coarsely divided into two categories: detection-free methods and detection-based methods. The detection-free methods attempt to recursively shrink the focus area till achieving the precise localization of the target [9], [10]. However, it is computationally prohibitive with the increasing of the gallery size. For the detection-based methods, the most common way is to divide the problem into pedestrian detection and person re-identification tasks [8], [11]. However, the two tasks are highly correlated. Firstly, the feature information can be shared to avoid accumulative error, and save heavy time cost for images of crowds. Secondly, detection and re-identification can complement each other. The qualities of detections largely determine the accuracy of recognition, while the results of recognition provide feedback to refine the locations of detections. Therefore, it will be beneficial to co-learn the pedestrian detection and person re-identification simultaneously.

Despite the considerable progress achieved in recent years, it is still a challenging problem to learn powerful features for person matching. The main reason is that the training samples for each identity are considerably small, and a large amount of unlabeled identities are existed in person search datasets. It is tough to learn discriminative person representations with many classes and little class-specific samples. Therefore, some approaches attempt to exploit the information of unlabeled pedestrians to reinforce the representation power. For example, the Online Instance Matching (OIM) loss [7] treats all the unlabeled persons as a negative class. It forces a labeled person to keep away from the different labeled identities stored in a lookup table, and the unlabeled persons maintained in a circular queue. Nevertheless, the unlabeled persons do not participate in the training process. To solve this problem, the Instance Enhancing Loss (IEL) [12] is proposed to integrate unlabeled persons into the feature learning process. It selectively annotates unlabeled new persons to the labeled identities that they are most similar to. However, the selected unlabeled persons are actually hard negative samples. To learn discriminative representations, those hard negative samples should keep away from the corresponding labeled identities.

To address the above issues, in this paper we propose an novel end-to-end person search framework which integrates both pedestrian detection and person identification to improve the overall accuracy and reduce computations. To make better use of the unlabeled persons, a novel Dynamic Imposter based Online Instance Matching (DI-OIM) loss is proposed. The proposed loss is inspired by the observation that pedestrians appearing in the same image obviously have different identities. Thus, we assign unlabeled persons with dynamic pseudo-labels. The representations of pseudo-labeled persons are defined as imposters, since they do not belong to any of the labeled identities. The features of all the labeled persons are stored in a lookup table. The imposters along with the lookup table are used to optimize the proposed framework. All the different persons are forced to keep away from each other. With the proposed DI-OIM loss, our end-to-end model demonstrates a good efficiency and effectiveness.

In summary, our main contributions are three-folds:

•
An end-to-end trainable learning framework is proposed for person search. The framework integrates pedestrian detection and person re-identification in a unified framework. By co-learning the two tasks, the learned features are more informative.
•
A novel DI-OIM loss is proposed to exploit the information of the unlabeled pedestrians. The proposed loss can not only distinguish labeled pedestrians from different identities, but also make the unlabeled pedestrians far from each other.
•
By unifying the detection and the re-identification tasks, the proposed model achieves state-of-the-art performances on the CUHK-SYSU [7] and PRW datasets [8].

Section snippets

Pedestrian detection

Pedestrian detection aims to localize pedestrians in images and generate bounding boxes for persons. In person search systems, pedestrian detection plays an important role. A large number of efforts have been made to automatically detect pedestrians in natural scenes. Traditional methods are mainly based on handcrafted features and linear classifiers, e.g. Aggregated Channel Features (ACF) [13] and Locally Decorrelated Channel Features (LDCF) [14]. Recently, Convolutional Neural Networks (CNNs)

The proposed approach

In this section, we firstly describe the overall architecture of our framework. Then we briefly explain the OIM loss and the IEL. After that, we elaborate the proposed DI-OIM loss and describe the inference process.

Experiments

In this section, we thoroughly evaluate our method on two public person search datasets. We first briefly introduce the datasets, the evaluation protocols and the implementation details. Secondly, we analyze the proposed loss and make comparisons with other related losses. To validate the effectiveness of our method, we then make extensive comparisons with state-of-the-art algorithms. At last, we conduct further analysis and discussions.

Conclusion

In this work, we focus on the problem of unconstrained person search, where pedestrian bounding boxes are unavailable. We propose an end-to-end framework to simultaneously consider pedestrian detection and person re-identification. Since many unlabeled pedestrians exist in person search datasets, a novel DI-OIM loss is proposed to exploit the information of unlabeled persons. Inspired by the observation that pedestrians within the same image obviously have different identities, we assign

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China (NSFC), Nos. 61725202, 61751212 and 61771088.

References (39)

X. Wang
Intelligent multi-camera video surveillance: a review
Pattern Recognit. Lett.
(2013)
P. Li et al.
Deep visual tracking: review and experimental comparison
Pattern Recognit.
(2018)
J. Dai et al.
Cross-view semantic projection learning for person re-identification
Pattern Recognit.
(2018)
V.E. Liong et al.
Regularized local metric learning for person re-identification
Pattern Recognit. Lett.
(2015)
S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognit.
(2015)
J. Wang et al.
Deep ranking model by large adaptive margin learning for person re-identification
Pattern Recognit.
(2017)
J. Xiao et al.
IAN: the individual aggregation network for person search
Pattern Recognit.
(2019)
S. Zhang et al.
Multi-target tracking by learning local-to-global trajectory models
Pattern Recognit.
(2014)
S. Liao et al.
Person re-identification by local maximal occurrence representation and metric learning
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2015)

L. Zhang et al.

Learning a discriminative null space for person re-identification

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2016)

T. Xiao et al.

Joint detection and identification feature learning for person search

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

L. Zheng et al.

Person re-identification in the wild

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

H. Liu et al.

Neural person search machines

Proceedings of IEEE International Conference on Computer Vision

(2017)

X. Chang et al.

RCAA: relational context-aware agents for person search

Proceedings of European Conference on Computer Vision

(2018)

D. Chen et al.

Person search via a mask-guided two-stream CNN model

Proceedings of European Conference on Computer Vision

(2018)

W. Shi et al.

Instance enhancing loss: deep identity-sensitive feature embedding for person search

Proceedings of IEEE International Conference on Image Processing

(2018)

D. Piotr et al.

Fast feature pyramids for object detection

(2014)

W. Nam et al.

Local decorrelation for improved pedestrian detection

Advances in Neural Information Processing Systems

(2014)

Cited by (21)

Towards effective person search with deep learning: A survey from systematic perspective
2024, Pattern Recognition
Person search detects and retrieves simultaneously a query person across uncropped scene images captured by multiple non-overlapping cameras. In light of the deep learning advancement, person search has emerged as a promising research direction that demonstrates great potential for real-world applications. This paper presents a systematic survey of deep learning methods for person search. Different from existing categorizations, we propose a new taxonomy that dissects person search models into four major components i.e., proposal prediction, feature representation learning, training objectives, and ranking optimization. The most representative works in each component are summarized with highlighted contributions to this field. An in-depth analysis is provided upon evaluation performances of state-of-the-art person search models together with a summary of benchmark datasets. Despite that significant progress has been made to date, practical and extendable person search remains an open task. We conclude with discussions on those under-explored yet challenging datasets and learning mechanisms for real-world demands to inspire future research directions.
MI<sup>3</sup>C: Mining intra- and inter-image context for person search
2024, Pattern Recognition
Person search aims to localize the queried person from a gallery of uncropped, realistic images. Unlike re-identification (Re-ID), person search deals with the entire scene image containing rich and diverse visual context information. However, existing works mainly focus on the person’s appearance while ignoring other essential intra- and inter-image context information. To comprehensively leverage the intra- and inter-image context, we propose a unified framework termed MI $^{3}$ C including the Intra-image Multi-View Context network (IMVC) and the Inter-image Group Context Ranking algorithm (IGCR). Concretely, the IMVC integrates the features from the scene, surrounding, instance, and part views collaboratively to generate the final ID feature for person search. Furthermore, the IGCR algorithm employs group matching results between query and gallery image pairs to measure the holistic image matching similarity, which is adopted as part of the sorting metric to yield a more robust ranking among the whole gallery. Extensive experiments on two popular person search benchmarks demonstrate that by mining intra- and inter-image context, our method outperforms previous state-of-the-art methods by conspicuous margins. Specifically, we achieve 96.7% mAP and 97.1% top-1 accuracy on the CUHK-SYSU dataset, 55.6% mAP, and 90.8% top-1 accuracy on the PRW dataset.
Joint discriminative representation learning for end-to-end person search
2024, Pattern Recognition
Person search simultaneously detects and retrieves a query person from uncropped scene images. Existing methods are either two-step or end-to-end. The former employs two standalone models for the two sub-tasks, while the latter conducts person search with a unified model. Despite encouraging progress, most existing end-to-end methods focus on balancing the model between detection and retrieval sub-tasks, while ignoring to enhance the learned representation for retrieval, which leads to inferior accuracy to two-step approaches. To that end, we propose a novel hierarchical framework that jointly optimizes instance-aware and part-aware embedding to enable discriminative representation learning. Specifically, we develop a region-of-interest cosegment (ROICoseg) module that captures part-aware information without requiring extra annotations to enable fine-grained discriminative representation. On top of that, a Contextual Instance Batch Sampling (CIBS) method is introduced to effectively employ contextual information for constructing training batches, thus facilitating effective instance-aware representation learning. We further introduce the first cross-door person search dataset (CDPS) that retrieves a target person in outdoor cameras with an indoor captured image or vice versa. Extensive experiments show that our proposed model achieves competitive performance on CUHK-SYSU and outperforms state-of-the-art end-to-end methods on the more challenging PRW and CDPS.¹
Learning feature contexts by transformer and CNN hybrid deep network for weakly supervised person search
2024, Computer Vision and Image Understanding
Person search is a computer vision task that aims to locate and re-identify specific pedestrians in images captured by non-overlapping cameras. However, the identity annotation in person search is labor-intensive, especially as the amount of data increases. Therefore, more and more studies consider training person search models using weakly supervised learning with only location annotations. The context information is useful to improve feature representations in the absence of pedestrian identity as supervision. Existing weakly supervised person search methods focus on logic-driven contexts while ignoring feature contexts. In this paper, we propose a hybrid deep network for weakly supervised person search. The hybrid architecture consists of a Transformer-based feature extraction network and a fully convolution-based region recognition head network. The purpose is to enable the model to learn feature contexts at different levels. In our network, hierarchical vision Transformers are used to extract features in order to obtain discriminative representations of scene images. The context-enhanced head network is designed to integrate different features for candidate pedestrians. In addition, a pedestrian proposal network is proposed to improve the quality of predicted proposals. Experiments are conducted on the CUHK-SYSU and the PRW benchmarks to evaluate the effectiveness of the proposed method.
Making person search enjoy the merits of person re-identification
2022, Pattern Recognition
Person search is an extended task of person re-identification (Re-ID). However, most existing one-step person search works do not study how to employ existing Re-ID models to improve the one-step person search. To address this issue, we propose a Teacher-guided Disentangling Network (TDN) to make the one-step person search enjoy the merits of existing Re-ID research. The proposed TDN can significantly boost person search performance by transferring the advanced person Re-ID knowledge to the person search model. In the proposed TDN, for better knowledge transfer from the Re-ID teacher model to the one-step person search model, we design a new one-step person search base framework by partially disentangling the two subtasks. Besides, we propose a Knowledge Transfer Bridge module to bridge the scale gap caused by different input formats between the Re-ID model and the one-step person search model. Moreover, we also propose a Ranking with Context Persons strategy to exploit the context information in panoramic images for better ranking. Experiments on two public person search datasets demonstrate the favorable performance of the proposed method.
Bcdnet: Balanced Coupling and Decoupling Network for Person Search
2024, SSRN

View all citing articles on Scopus

View full text

Dynamic imposter based online instance matching for person search

Abstract

Introduction

Section snippets

Pedestrian detection

The proposed approach

Experiments

Conclusion

Acknowledgment

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Multi-target tracking by learning local-to-global trajectory models

Pattern Recognit.

Person re-identification by local maximal occurrence representation and metric learning

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Learning a discriminative null space for person re-identification

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Joint detection and identification feature learning for person search

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Person re-identification in the wild

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Neural person search machines

Proceedings of IEEE International Conference on Computer Vision

RCAA: relational context-aware agents for person search

Proceedings of European Conference on Computer Vision

Person search via a mask-guided two-stream CNN model

Proceedings of European Conference on Computer Vision

Instance enhancing loss: deep identity-sensitive feature embedding for person search

Proceedings of IEEE International Conference on Image Processing

Fast feature pyramids for object detection

Local decorrelation for improved pedestrian detection

Advances in Neural Information Processing Systems