Diverse receptive field network with context aggregation for fast object detection

https://doi.org/10.1016/j.jvcir.2020.102770Get rights and content

Highlights

  • The context introduced by DRF modules improves the detection performance by assisting the detector to better distinguish the background and objects.

  • The design principles for DRF modules can better encode contextual information.

  • Almost all the categories are benefitted from context aggregation, and serval monotonous-background categories and small-scale objects get large gains.

  • The proposed DRFNet achieves a good trade-off between speed and accuracy on PASCAL VOC and MS COCO datasets.

Abstract

Current context-utilizing detectors are all based on two-stage approaches. However, their computational efficiency and context quality extremely depend on the accuracy of proposal-generating methods, which limits their performance and makes them hardly perform real-time detection. In this work, we present a context-exploited method that integrates features in different receptive fields to obtain contextual representation. Based on this idea, we put forward the multi-branch diverse receptive field modules (DRF modules) and their design principles to encode context. To further utilize contextual information for fast object detection, we propose a one-stage diverse receptive field network (DRFNet). In DRFNet, the DRF modules are first applied to capture rich context as the basis, then a parallel structure is constructed to exploit the context at different scales along with DRF modules. Comprehensive experiments indicate that the context introduced by our methods improves the detection performance and DRFNet achieves a good trade-off between speed and accuracy.

Introduction

Object detection has enjoyed significant advances benefiting from the rapid process of deep convolutional neural networks [1], [2], [3], [4], [5], [6]. The state-of-the-art object detectors [7], [8], [9], [10], [11] have achieved impressive performance in detection benchmarks. However, most of them only exploit the features inside the proposals or anchors and ignore the contextual information surrounding them. It has been recognized that contextual information plays an important role in object detection [12], [13].

To utilize contextual information, an intuitive approach is to employ multi-region for feature aggregation [14], [15], [16], [17], [18], [19], [20], [21], as shown in Fig. 1(a). Such methods extract features in multi-region associated with the original proposal, then fuse them to obtain contextual representations.

Another idea is committed to creating contextual clues to guide the detection [22], [23], [24], [25], [26], [27], which is popular in both traditional methods and CNN-based detectors, as shown in Fig. 1(b). Traditional methods [12], [13], [25] usually utilize local or global appearance statistics to assist detection, while in the deep learning area context clues are more diverse, such as segmentation information [23], previous detection results [28], instance-level relationships [29], [30] and so on. Compared with the former, this kind of approach utilizes context more explicitly.

Nevertheless, both multi-region methods and contextual-cue methods are based on proposals to further exploit context information or contextual clues. This means that they can only be embedded in two-stage detectors. In addition, their computation efficiency and context quality depend on the accuracy of proposal-generating methods, which limits their performance and makes them hardly perform real-time detection. For example, if the pre-generating proposals only contain the background or some parts of the object, it is meaningless to exploit context based on this kind of proposals. However, this situation is inevitable in two-stage methods currently.

So why do we not exploit context in the feature extraction stage regardless of proposals? In spired by multi-region methods [14], [15], [16], [17], [18], [19], [20], [21], we believe that the aggregated features from multi-scale regions contain contextual information, so do the features in different receptive fields (RFs). Therefore, we fuse features with different RFs in the same regions during feature extraction, so that the obtained features encode context information and do not rely on proposals, as shown in Fig. 1(c). Since contextual representations are obtained in the feature-extracting stage, this contextual module can be embedded in one-stage approaches.

Based on this idea, we propose a one-stage diverse receptive field network (DRFNet) aiming to utilize context information for fast object detection. Specifically, we put forward the multi-branch fully-convolutional DRF modules and the parallel structure to achieve this. The multi-branch DRF modules are designed according to three design principles and each branch stacks conv layers and dilated conv layers [31] to exploit detail information and the context. Additionally, the parallel structure is constructed via the spatial pyramid pooling to capture the context at different scales along with DRF modules. The general framework of DRFNet is shown in Fig. 2.

We show that the context introduced by our methods can improve the detection performance of most categories from 0.5% mAP to 3% mAP on PASCAL VOC. In addition, serval monotonous-background categories and small-scale objects get large gains such as bottle (7.8% mAP), chair (7.6% mAP), boat (6.5% mAP), airplane (4.9% mAP), and TV (4.5% mAP). These validate that our method is effective and the introduced local context assists the detector in better distinguishing the background and foreground, which improves detection performance.

Contribution: To summarize, our contributions are as follows:

  • 1.

    We propose a new idea of exploiting context information for fast object detection, which aggregates features in different receptive fields in the feature extraction stage. Therefore, it no longer relies on proposals and can be embedded in one-stage approaches with high computational efficiency.

  • 2.

    We put forward DRF modules and their design principles, according to which we can make better use of DRF modules to exploit context information. Additionally, we can design DRF modules in any detection framework, which facilitates the follow-up research.

  • 3.

    We design a one-stage DRFNet, which employs parallel branching structure and DRF modules to integrate context information at different scales for fast object detection. DRFNet300 and DRFNet512 achieve 80.2% mAP (38.2FPS) and 82.3% mAP (18.4FPS) on PASCAL VOC 2007, and 78.3% mAP and 80.4% mAP on PASCAL VOC 2012 respectively. On COCO, DRFNet300 and DRFNet 512 produce 30% AP and 33.5% AP.

The remainder of this paper is organized as follows. Section 2 reviews the related work. In Section 3 we investigate the DRF modules, and Section 4 details the proposed DRFNet. Experiments are given in Section 5, followed by the conclusion in Section 6.

Section snippets

Related work

Currently, state-of-the-art object detectors can be divided into two categories: two-stage approaches [7], [11], [32], [33], [34] with outstanding accuracy and one-stage approaches [8], [9], [10], [35], [36], [37], [38] with high computational efficiency. The former firstly generate a sparse set of proposals, then classify them and regress their location, while the latter directly classify and refine the multi-scale anchors.

Two-stage approaches. The two-stage approach was first introduced by

Diverse receptive field module

To obtain contextual representations in the feature-extracting stage, we present the diverse receptive field module, as shown in Fig. 1(c). The DRF module is constructed as a multi-branch convolutional block, and each branch stacks conv layers and dilated conv layers to capture different receptive field features. Then, the features are aggregated to encode contextual information in the output feature maps.

In this part, we mainly investigate two problems: 1. Do the DRF modules truly introduce

Diverse receptive field network

To utilize multi-scale context information for object detection, we construct the DRFNet as shown in Fig. 2. Firstly, the DRF modules are employed in the backbone to obtain rich context as the basis. Then, the parallel structure is applied along with serval DRF modules to encode contextual information at different scales.

In this section, we first present the structure details of the parallel structure and DRF modules. Next, we introduce the application details of DRFNet, including the changes

Experiment

Our method is evaluated on three challenging datasets: PASCAL VOC 2007, PASCAL VOC 2012 [54], and MS COCO [55]. The PASCAL VOC and MS COCO include 20 categories and 80 categories, respectively. We mainly compare our DRFNet with context-based detectors and state-of-the-art methods. Comprehensive experiments indicate that the context introduced by our methods improves the detection performance to a large gain and the proposed DRFNet has a good trade-off between speed and accuracy.

Conclusions

In this paper, we propose the one-stage DRFNet which aggregates context information in diverse receptive fields for fast object detection. The proposed DRF modules are effective in obtaining contextual representations at different scales along with the parallel structure. In addition, we put forward three design principles for DRF modules to better exploit the context. Comprehensive experiments on PASCAL VOC and MS COCO indicate that the DRFNet achieves a good trade-off between speed and

CRediT authorship contribution statement

Shaorong Xie: Conceptualization, Investigation, Supervision, Project administration. Chang Liu: Conceptualization, Methodology, Software, Writing - original draft. Jiantao Gao: Methodology, Validation, Formal analysis. Xiaomao Li: Methodology, Writing - original draft. Jun Luo: Resources, Supervision. Baojie Fan: Formal analysis. Jiahong Chen: Methodology, Data curation. Huayan Pu: Resources. Yan Peng: Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [Grant Nos. 61625304, 91648119, and 61673254].

References (56)

  • S. Xie et al.

    Aggregated residual transformations for deep neural networks

  • S. Ren et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Adv. Neural Informat. Process. Syst.

    (2015)
  • W. Liu et al.

    Ssd: Single shot multibox detector

  • J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767,...
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE...
  • T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in:...
  • S.K. Divvala, D. Hoiem, J.H. Hays, A.A. Efros, M. Hebert, An empirical study of context in object detection, in: 2009...
  • R. Mottaghi et al.

    The role of context for object detection and semantic segmentation in the wild

  • J. Li et al.

    Attentive contexts for object detection

    IEEE Trans. Multimedia

    (2016)
  • X. Zeng et al.

    Crafting gbd-net for object detection

    IEEE Trans. Pattern Anal. Machine Intell.

    (2017)
  • S. Gidaris et al.

    Object detection via a multi-region and semantic segmentation-aware cnn model

  • Y. Zhu et al.

    Couplenet: Coupling global structure with local parts for object detection

  • K. Li et al.

    Rotation-insensitive and context-augmented object detection in remote sensing images

    IEEE Trans. Geosci. Remote Sens.

    (2017)
  • S. Gupta, B. Hariharan, J. Malik, Exploring person context and local scene context for object detection, arXiv preprint...
  • A. Shrivastava et al.

    Contextual priming and feedback for faster r-cnn

  • W. Ouyang et al.

    Deepid-net: Object detection with deformable part based convolutional neural networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • D. Parikh et al.

    Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • X. Chen et al.

    Spatial memory for context reasoning in object detection

  • Cited by (4)

    • Object detection via inner-inter relational reasoning network

      2023, Image and Vision Computing
      Citation Excerpt :

      Experimental results on two public object detection benchmarks reveal that the proposed model can consistently improve the performance of state-of-the-art region-based methods. For the decades, many works are devoted to improve the quality of object detection, including feature enhancement [25,26], eliminating the false positive [27] and context information aggregation [28]. While recently, solving the object detection utilizing learnable relationships between extracted proposals and category labels has been attached more and more research attention.

    • Densely connected convolutional transformer for single image dehazing

      2023, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      As a result of haze formation, severe information loss and degradation occurs in the captured images. This negatively influences the performance of high-level vision tasks like classification [1–3], object detection [4–6], and segmentation [7–9]. The critical nature of the applications of these high level vision tasks in areas like [10], lane-detection [11] and surveillance [12] make it imperative to restore haze-free images.

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text