Diverse receptive field network with context aggregation for fast object detection

doi:10.1016/j.jvcir.2020.102770

Journal of Visual Communication and Image Representation

Volume 70, July 2020, 102770

https://doi.org/10.1016/j.jvcir.2020.102770 Get rights and content

Highlights

•
The context introduced by DRF modules improves the detection performance by assisting the detector to better distinguish the background and objects.
•
The design principles for DRF modules can better encode contextual information.
•
Almost all the categories are benefitted from context aggregation, and serval monotonous-background categories and small-scale objects get large gains.
•
The proposed DRFNet achieves a good trade-off between speed and accuracy on PASCAL VOC and MS COCO datasets.

Abstract

Current context-utilizing detectors are all based on two-stage approaches. However, their computational efficiency and context quality extremely depend on the accuracy of proposal-generating methods, which limits their performance and makes them hardly perform real-time detection. In this work, we present a context-exploited method that integrates features in different receptive fields to obtain contextual representation. Based on this idea, we put forward the multi-branch diverse receptive field modules (DRF modules) and their design principles to encode context. To further utilize contextual information for fast object detection, we propose a one-stage diverse receptive field network (DRFNet). In DRFNet, the DRF modules are first applied to capture rich context as the basis, then a parallel structure is constructed to exploit the context at different scales along with DRF modules. Comprehensive experiments indicate that the context introduced by our methods improves the detection performance and DRFNet achieves a good trade-off between speed and accuracy.

Introduction

Object detection has enjoyed significant advances benefiting from the rapid process of deep convolutional neural networks [1], [2], [3], [4], [5], [6]. The state-of-the-art object detectors [7], [8], [9], [10], [11] have achieved impressive performance in detection benchmarks. However, most of them only exploit the features inside the proposals or anchors and ignore the contextual information surrounding them. It has been recognized that contextual information plays an important role in object detection [12], [13].

To utilize contextual information, an intuitive approach is to employ multi-region for feature aggregation [14], [15], [16], [17], [18], [19], [20], [21], as shown in Fig. 1(a). Such methods extract features in multi-region associated with the original proposal, then fuse them to obtain contextual representations.

Another idea is committed to creating contextual clues to guide the detection [22], [23], [24], [25], [26], [27], which is popular in both traditional methods and CNN-based detectors, as shown in Fig. 1(b). Traditional methods [12], [13], [25] usually utilize local or global appearance statistics to assist detection, while in the deep learning area context clues are more diverse, such as segmentation information [23], previous detection results [28], instance-level relationships [29], [30] and so on. Compared with the former, this kind of approach utilizes context more explicitly.

Nevertheless, both multi-region methods and contextual-cue methods are based on proposals to further exploit context information or contextual clues. This means that they can only be embedded in two-stage detectors. In addition, their computation efficiency and context quality depend on the accuracy of proposal-generating methods, which limits their performance and makes them hardly perform real-time detection. For example, if the pre-generating proposals only contain the background or some parts of the object, it is meaningless to exploit context based on this kind of proposals. However, this situation is inevitable in two-stage methods currently.

So why do we not exploit context in the feature extraction stage regardless of proposals? In spired by multi-region methods [14], [15], [16], [17], [18], [19], [20], [21], we believe that the aggregated features from multi-scale regions contain contextual information, so do the features in different receptive fields (RFs). Therefore, we fuse features with different RFs in the same regions during feature extraction, so that the obtained features encode context information and do not rely on proposals, as shown in Fig. 1(c). Since contextual representations are obtained in the feature-extracting stage, this contextual module can be embedded in one-stage approaches.

Based on this idea, we propose a one-stage diverse receptive field network (DRFNet) aiming to utilize context information for fast object detection. Specifically, we put forward the multi-branch fully-convolutional DRF modules and the parallel structure to achieve this. The multi-branch DRF modules are designed according to three design principles and each branch stacks conv layers and dilated conv layers [31] to exploit detail information and the context. Additionally, the parallel structure is constructed via the spatial pyramid pooling to capture the context at different scales along with DRF modules. The general framework of DRFNet is shown in Fig. 2.

We show that the context introduced by our methods can improve the detection performance of most categories from 0.5% mAP to 3% mAP on PASCAL VOC. In addition, serval monotonous-background categories and small-scale objects get large gains such as bottle (7.8% mAP), chair (7.6% mAP), boat (6.5% mAP), airplane (4.9% mAP), and TV (4.5% mAP). These validate that our method is effective and the introduced local context assists the detector in better distinguishing the background and foreground, which improves detection performance.

Contribution: To summarize, our contributions are as follows:

1.
We propose a new idea of exploiting context information for fast object detection, which aggregates features in different receptive fields in the feature extraction stage. Therefore, it no longer relies on proposals and can be embedded in one-stage approaches with high computational efficiency.
2.
We put forward DRF modules and their design principles, according to which we can make better use of DRF modules to exploit context information. Additionally, we can design DRF modules in any detection framework, which facilitates the follow-up research.
3.
We design a one-stage DRFNet, which employs parallel branching structure and DRF modules to integrate context information at different scales for fast object detection. DRFNet300 and DRFNet512 achieve 80.2% mAP (38.2FPS) and 82.3% mAP (18.4FPS) on PASCAL VOC 2007, and 78.3% mAP and 80.4% mAP on PASCAL VOC 2012 respectively. On COCO, DRFNet300 and DRFNet 512 produce 30% AP and 33.5% AP.

The remainder of this paper is organized as follows. Section 2 reviews the related work. In Section 3 we investigate the DRF modules, and Section 4 details the proposed DRFNet. Experiments are given in Section 5, followed by the conclusion in Section 6.

Section snippets

Related work

Currently, state-of-the-art object detectors can be divided into two categories: two-stage approaches [7], [11], [32], [33], [34] with outstanding accuracy and one-stage approaches [8], [9], [10], [35], [36], [37], [38] with high computational efficiency. The former firstly generate a sparse set of proposals, then classify them and regress their location, while the latter directly classify and refine the multi-scale anchors.

Two-stage approaches. The two-stage approach was first introduced by

Diverse receptive field module

To obtain contextual representations in the feature-extracting stage, we present the diverse receptive field module, as shown in Fig. 1(c). The DRF module is constructed as a multi-branch convolutional block, and each branch stacks conv layers and dilated conv layers to capture different receptive field features. Then, the features are aggregated to encode contextual information in the output feature maps.

In this part, we mainly investigate two problems: 1. Do the DRF modules truly introduce

Diverse receptive field network

To utilize multi-scale context information for object detection, we construct the DRFNet as shown in Fig. 2. Firstly, the DRF modules are employed in the backbone to obtain rich context as the basis. Then, the parallel structure is applied along with serval DRF modules to encode contextual information at different scales.

In this section, we first present the structure details of the parallel structure and DRF modules. Next, we introduce the application details of DRFNet, including the changes

Experiment

Our method is evaluated on three challenging datasets: PASCAL VOC 2007, PASCAL VOC 2012 [54], and MS COCO [55]. The PASCAL VOC and MS COCO include 20 categories and 80 categories, respectively. We mainly compare our DRFNet with context-based detectors and state-of-the-art methods. Comprehensive experiments indicate that the context introduced by our methods improves the detection performance to a large gain and the proposed DRFNet has a good trade-off between speed and accuracy.

Conclusions

In this paper, we propose the one-stage DRFNet which aggregates context information in diverse receptive fields for fast object detection. The proposed DRF modules are effective in obtaining contextual representations at different scales along with the parallel structure. In addition, we put forward three design principles for DRF modules to better exploit the context. Comprehensive experiments on PASCAL VOC and MS COCO indicate that the DRFNet achieves a good trade-off between speed and

CRediT authorship contribution statement

Shaorong Xie: Conceptualization, Investigation, Supervision, Project administration. Chang Liu: Conceptualization, Methodology, Software, Writing - original draft. Jiantao Gao: Methodology, Validation, Formal analysis. Xiaomao Li: Methodology, Writing - original draft. Jun Luo: Resources, Supervision. Baojie Fan: Formal analysis. Jiahong Chen: Methodology, Data curation. Huayan Pu: Resources. Yan Peng: Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [Grant Nos. 61625304, 91648119, and 61673254].

References (56)

W. Chu et al.
Deep feature based contextual model for object detection
Neurocomputing
(2018)
K.-D. Nguyen et al.
You always look again: Learning to detect the unseen objects
J. Vis. Commun. Image Represent.
(2019)
Z. Chen et al.
Context refinement for object detection
J. Zhu et al.
Object recognition via contextual color attention
J. Vis. Commun. Image Represent.
(2015)
Y. Ren et al.
Context-assisted 3d (c3d) object detection from rgb-d images
J. Vis. Commun. Image Represent.
(2018)
A. Krizhevsky et al.
Imagenet classification with deep convolutional neural networks
Adv. Neural Informat. Process. Syst.
(2012)
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint...
C. Szegedy et al.
Going deeper with convolutions
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: European Conference on Computer...

S. Xie et al.

Aggregated residual transformations for deep neural networks

S. Ren et al.

Faster r-cnn: Towards real-time object detection with region proposal networks

Adv. Neural Informat. Process. Syst.

(2015)

W. Liu et al.

Ssd: Single shot multibox detector

J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767,...

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE...

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in:...

S.K. Divvala, D. Hoiem, J.H. Hays, A.A. Efros, M. Hebert, An empirical study of context in object detection, in: 2009...

R. Mottaghi et al.

The role of context for object detection and semantic segmentation in the wild

J. Li et al.

Attentive contexts for object detection

IEEE Trans. Multimedia

(2016)

X. Zeng et al.

Crafting gbd-net for object detection

IEEE Trans. Pattern Anal. Machine Intell.

(2017)

S. Gidaris et al.

Object detection via a multi-region and semantic segmentation-aware cnn model

Y. Zhu et al.

Couplenet: Coupling global structure with local parts for object detection

K. Li et al.

Rotation-insensitive and context-augmented object detection in remote sensing images

IEEE Trans. Geosci. Remote Sens.

(2017)

S. Gupta, B. Hariharan, J. Malik, Exploring person context and local scene context for object detection, arXiv preprint...

A. Shrivastava et al.

Contextual priming and feedback for faster r-cnn

W. Ouyang et al.

Deepid-net: Object detection with deformable part based convolutional neural networks

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

D. Parikh et al.

Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

X. Chen et al.

Spatial memory for context reasoning in object detection

Cited by (4)

Object detection via inner-inter relational reasoning network
2023, Image and Vision Computing
Citation Excerpt :
Experimental results on two public object detection benchmarks reveal that the proposed model can consistently improve the performance of state-of-the-art region-based methods. For the decades, many works are devoted to improve the quality of object detection, including feature enhancement [25,26], eliminating the false positive [27] and context information aggregation [28]. While recently, solving the object detection utilizing learnable relationships between extracted proposals and category labels has been attached more and more research attention.
Exploiting relationships between objects and (or) labels under graph message passing mechanism to facilitate object detection has been widely investigated in recent years. However, these methods heavily rely on hand-crafted graph structures, which may introduce unreliable relationships and in turn hurt the object detection performance. Aiming to address this issue, we propose a novel object detection framework that fully explores the relational representations for objects and labels under a full attention architecture. Specifically, we directly regard the extracted proposals and candidate labels as two independent sets in visual feature space and label embedding space, respectively. And we design a self-attention module to discover the inner-relationships within the visual feature space or label embedding space. In addition, a cross-attention module is developed to explore the inter-relationships between the two spaces. Furthermore, both the inner-relationships and inter-relationships are utilized to enhance the object features and label embedding representations to facilitate the object detection. To validate the proposed framework in improving object detection performance, we embed it into several state-of-the-art baselines and perform extensive experiments on two public benchmarks (named Pascal VOC and COCO 2017). The experimental results demonstrate the effectiveness and flexibility of the proposed framework.
Densely connected convolutional transformer for single image dehazing
2023, Journal of Visual Communication and Image Representation
Citation Excerpt :
As a result of haze formation, severe information loss and degradation occurs in the captured images. This negatively influences the performance of high-level vision tasks like classification [1–3], object detection [4–6], and segmentation [7–9]. The critical nature of the applications of these high level vision tasks in areas like [10], lane-detection [11] and surveillance [12] make it imperative to restore haze-free images.
Image Dehazing is an important low-level vision task that aims to remove the haze from an image. In this paper, we proposed Densely Connected Convolutional Transformer (DCCT) for single image dehazing. DCCT is an efficient architecture that combines the multi-head Performer with the local dependencies. To prevent loss of information between features at different levels, we propose a learnable connection layer that is used to fuse features at different levels across the entire architecture. We guide the training of DCCT through a joint loss considering a supervised metric learning approach that allows us to consider both negative and positive features for a multi-image perceptual loss. We validate the design choices and the effectiveness of the proposed DCCT through ablation studies. Through comparison with the representative techniques, we establish that the proposed DCCT is highly competitive with the state of the art.
Local Enhancement and Bidirectional Feature Refinement Network for Single-Shot Detector
2022, Cognitive Computation
Fast-Yolo-Rec: Incorporating Yolo-Base Detection and Recurrent-Base Prediction Networks for Fast Vehicle Detection in Consecutive Images
2022, IEEE Access

^☆: This paper has been recommended for acceptance by Zicheng Liu.

View full text

Diverse receptive field network with context aggregation for fast object detection☆

Highlights

Abstract

Introduction

Section snippets

Related work

Diverse receptive field module

Diverse receptive field network

Experiment

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

J. Vis. Commun. Image Represent.

J. Vis. Commun. Image Represent.

J. Vis. Commun. Image Represent.

Imagenet classification with deep convolutional neural networks

Adv. Neural Informat. Process. Syst.

Going deeper with convolutions

Aggregated residual transformations for deep neural networks

Faster r-cnn: Towards real-time object detection with region proposal networks

Adv. Neural Informat. Process. Syst.

Ssd: Single shot multibox detector

The role of context for object detection and semantic segmentation in the wild

Attentive contexts for object detection

IEEE Trans. Multimedia

Crafting gbd-net for object detection

IEEE Trans. Pattern Anal. Machine Intell.

Object detection via a multi-region and semantic segmentation-aware cnn model

Couplenet: Coupling global structure with local parts for object detection

Rotation-insensitive and context-augmented object detection in remote sensing images

IEEE Trans. Geosci. Remote Sens.

Contextual priming and feedback for faster r-cnn

Deepid-net: Object detection with deformable part based convolutional neural networks

IEEE Trans. Pattern Anal. Mach. Intell.

Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Spatial memory for context reasoning in object detection