Diverse receptive field network with context aggregation for fast object detection☆
Introduction
Object detection has enjoyed significant advances benefiting from the rapid process of deep convolutional neural networks [1], [2], [3], [4], [5], [6]. The state-of-the-art object detectors [7], [8], [9], [10], [11] have achieved impressive performance in detection benchmarks. However, most of them only exploit the features inside the proposals or anchors and ignore the contextual information surrounding them. It has been recognized that contextual information plays an important role in object detection [12], [13].
To utilize contextual information, an intuitive approach is to employ multi-region for feature aggregation [14], [15], [16], [17], [18], [19], [20], [21], as shown in Fig. 1(a). Such methods extract features in multi-region associated with the original proposal, then fuse them to obtain contextual representations.
Another idea is committed to creating contextual clues to guide the detection [22], [23], [24], [25], [26], [27], which is popular in both traditional methods and CNN-based detectors, as shown in Fig. 1(b). Traditional methods [12], [13], [25] usually utilize local or global appearance statistics to assist detection, while in the deep learning area context clues are more diverse, such as segmentation information [23], previous detection results [28], instance-level relationships [29], [30] and so on. Compared with the former, this kind of approach utilizes context more explicitly.
Nevertheless, both multi-region methods and contextual-cue methods are based on proposals to further exploit context information or contextual clues. This means that they can only be embedded in two-stage detectors. In addition, their computation efficiency and context quality depend on the accuracy of proposal-generating methods, which limits their performance and makes them hardly perform real-time detection. For example, if the pre-generating proposals only contain the background or some parts of the object, it is meaningless to exploit context based on this kind of proposals. However, this situation is inevitable in two-stage methods currently.
So why do we not exploit context in the feature extraction stage regardless of proposals? In spired by multi-region methods [14], [15], [16], [17], [18], [19], [20], [21], we believe that the aggregated features from multi-scale regions contain contextual information, so do the features in different receptive fields (RFs). Therefore, we fuse features with different RFs in the same regions during feature extraction, so that the obtained features encode context information and do not rely on proposals, as shown in Fig. 1(c). Since contextual representations are obtained in the feature-extracting stage, this contextual module can be embedded in one-stage approaches.
Based on this idea, we propose a one-stage diverse receptive field network (DRFNet) aiming to utilize context information for fast object detection. Specifically, we put forward the multi-branch fully-convolutional DRF modules and the parallel structure to achieve this. The multi-branch DRF modules are designed according to three design principles and each branch stacks conv layers and dilated conv layers [31] to exploit detail information and the context. Additionally, the parallel structure is constructed via the spatial pyramid pooling to capture the context at different scales along with DRF modules. The general framework of DRFNet is shown in Fig. 2.
We show that the context introduced by our methods can improve the detection performance of most categories from 0.5% mAP to 3% mAP on PASCAL VOC. In addition, serval monotonous-background categories and small-scale objects get large gains such as bottle (7.8% mAP), chair (7.6% mAP), boat (6.5% mAP), airplane (4.9% mAP), and TV (4.5% mAP). These validate that our method is effective and the introduced local context assists the detector in better distinguishing the background and foreground, which improves detection performance.
Contribution: To summarize, our contributions are as follows:
- 1.
We propose a new idea of exploiting context information for fast object detection, which aggregates features in different receptive fields in the feature extraction stage. Therefore, it no longer relies on proposals and can be embedded in one-stage approaches with high computational efficiency.
- 2.
We put forward DRF modules and their design principles, according to which we can make better use of DRF modules to exploit context information. Additionally, we can design DRF modules in any detection framework, which facilitates the follow-up research.
- 3.
We design a one-stage DRFNet, which employs parallel branching structure and DRF modules to integrate context information at different scales for fast object detection. DRFNet300 and DRFNet512 achieve 80.2% mAP (38.2FPS) and 82.3% mAP (18.4FPS) on PASCAL VOC 2007, and 78.3% mAP and 80.4% mAP on PASCAL VOC 2012 respectively. On COCO, DRFNet300 and DRFNet 512 produce 30% AP and 33.5% AP.
The remainder of this paper is organized as follows. Section 2 reviews the related work. In Section 3 we investigate the DRF modules, and Section 4 details the proposed DRFNet. Experiments are given in Section 5, followed by the conclusion in Section 6.
Section snippets
Related work
Currently, state-of-the-art object detectors can be divided into two categories: two-stage approaches [7], [11], [32], [33], [34] with outstanding accuracy and one-stage approaches [8], [9], [10], [35], [36], [37], [38] with high computational efficiency. The former firstly generate a sparse set of proposals, then classify them and regress their location, while the latter directly classify and refine the multi-scale anchors.
Two-stage approaches. The two-stage approach was first introduced by
Diverse receptive field module
To obtain contextual representations in the feature-extracting stage, we present the diverse receptive field module, as shown in Fig. 1(c). The DRF module is constructed as a multi-branch convolutional block, and each branch stacks conv layers and dilated conv layers to capture different receptive field features. Then, the features are aggregated to encode contextual information in the output feature maps.
In this part, we mainly investigate two problems: 1. Do the DRF modules truly introduce
Diverse receptive field network
To utilize multi-scale context information for object detection, we construct the DRFNet as shown in Fig. 2. Firstly, the DRF modules are employed in the backbone to obtain rich context as the basis. Then, the parallel structure is applied along with serval DRF modules to encode contextual information at different scales.
In this section, we first present the structure details of the parallel structure and DRF modules. Next, we introduce the application details of DRFNet, including the changes
Experiment
Our method is evaluated on three challenging datasets: PASCAL VOC 2007, PASCAL VOC 2012 [54], and MS COCO [55]. The PASCAL VOC and MS COCO include 20 categories and 80 categories, respectively. We mainly compare our DRFNet with context-based detectors and state-of-the-art methods. Comprehensive experiments indicate that the context introduced by our methods improves the detection performance to a large gain and the proposed DRFNet has a good trade-off between speed and accuracy.
Conclusions
In this paper, we propose the one-stage DRFNet which aggregates context information in diverse receptive fields for fast object detection. The proposed DRF modules are effective in obtaining contextual representations at different scales along with the parallel structure. In addition, we put forward three design principles for DRF modules to better exploit the context. Comprehensive experiments on PASCAL VOC and MS COCO indicate that the DRFNet achieves a good trade-off between speed and
CRediT authorship contribution statement
Shaorong Xie: Conceptualization, Investigation, Supervision, Project administration. Chang Liu: Conceptualization, Methodology, Software, Writing - original draft. Jiantao Gao: Methodology, Validation, Formal analysis. Xiaomao Li: Methodology, Writing - original draft. Jun Luo: Resources, Supervision. Baojie Fan: Formal analysis. Jiahong Chen: Methodology, Data curation. Huayan Pu: Resources. Yan Peng: Methodology, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China [Grant Nos. 61625304, 91648119, and 61673254].
References (56)
- et al.
Deep feature based contextual model for object detection
Neurocomputing
(2018) - et al.
You always look again: Learning to detect the unseen objects
J. Vis. Commun. Image Represent.
(2019) - et al.
Context refinement for object detection
- et al.
Object recognition via contextual color attention
J. Vis. Commun. Image Represent.
(2015) - et al.
Context-assisted 3d (c3d) object detection from rgb-d images
J. Vis. Commun. Image Represent.
(2018) - et al.
Imagenet classification with deep convolutional neural networks
Adv. Neural Informat. Process. Syst.
(2012) - K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint...
- et al.
Going deeper with convolutions
- K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
- K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: European Conference on Computer...
Aggregated residual transformations for deep neural networks
Faster r-cnn: Towards real-time object detection with region proposal networks
Adv. Neural Informat. Process. Syst.
Ssd: Single shot multibox detector
The role of context for object detection and semantic segmentation in the wild
Attentive contexts for object detection
IEEE Trans. Multimedia
Crafting gbd-net for object detection
IEEE Trans. Pattern Anal. Machine Intell.
Object detection via a multi-region and semantic segmentation-aware cnn model
Couplenet: Coupling global structure with local parts for object detection
Rotation-insensitive and context-augmented object detection in remote sensing images
IEEE Trans. Geosci. Remote Sens.
Contextual priming and feedback for faster r-cnn
Deepid-net: Object detection with deformable part based convolutional neural networks
IEEE Trans. Pattern Anal. Mach. Intell.
Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Spatial memory for context reasoning in object detection
Cited by (4)
Object detection via inner-inter relational reasoning network
2023, Image and Vision ComputingCitation Excerpt :Experimental results on two public object detection benchmarks reveal that the proposed model can consistently improve the performance of state-of-the-art region-based methods. For the decades, many works are devoted to improve the quality of object detection, including feature enhancement [25,26], eliminating the false positive [27] and context information aggregation [28]. While recently, solving the object detection utilizing learnable relationships between extracted proposals and category labels has been attached more and more research attention.
Densely connected convolutional transformer for single image dehazing
2023, Journal of Visual Communication and Image RepresentationCitation Excerpt :As a result of haze formation, severe information loss and degradation occurs in the captured images. This negatively influences the performance of high-level vision tasks like classification [1–3], object detection [4–6], and segmentation [7–9]. The critical nature of the applications of these high level vision tasks in areas like [10], lane-detection [11] and surveillance [12] make it imperative to restore haze-free images.
Local Enhancement and Bidirectional Feature Refinement Network for Single-Shot Detector
2022, Cognitive Computation
- ☆
This paper has been recommended for acceptance by Zicheng Liu.