Interactive multi-scale feature representation enhancement for small object detection

https://doi.org/10.1016/j.imavis.2021.104128Get rights and content

Highlights

  • An auxiliary multi-scale input network for enhancing shallow features.

  • Interactive multi-scale features for aggregating contextual information.

  • An adaptive factor embedded in the interactive module to control the fusion ratio.

Abstract

In the field of detection, there is a wide gap between the performance of small objects and that of medium, large objects. Some studies show that this gap is due to the contradiction between the classification-based backbone and localization. Although the reduction in the feature map size is beneficial for the extraction of abstract features, it will cause the loss of detailed features in the localization as traversing the backbone. Therefore, an interactive multi-scale feature representation enhancement strategy is proposed. This strategy includes two modules: first a multi-scale auxiliary enhancement network is proposed for feature interaction under multiple inputs. We scale the input to multiple scales corresponding to the prediction layers, and only passes through the lightweight extraction module to extract more detailed features for enhancing the original futures. Moreover, an adaptive interaction module is designed to aggregate the features of adjacent layers. This approach provides flexibility in achieving the improvement of small objects detection ability without changing the original network structure. Comprehensive experimental results based on PASCAL VOC and MS COCO datasets show the effectiveness of the proposed method.

Introduction

At present, deep learning-based object detection models all rely on CNN [1] classification networks as feature extractors. For example, Single-shot detection (SSD) [2] uses VGG [3], YOLO [4] uses DarkNet [5], and Faster R-CNN [6] uses ResNet [7]. We generally call these networks the backbone of the object detection model. For classification tasks, the accuracy of classification depends on the quality of feature extraction. Strided convolutions and max-pooling are considered an essential part of a convolutional backbone. They help CNNs perform feature extraction on a larger image region and increase the receptive field, which works quite well in practice for classification. Considering that the object detection task also requires the identification of classes of objects in the image, a very straightforward idea is to directly use the image classification network for object detection tasks. However, in addition to classification, object detection is more important in localization, which is precisely what the network for classification tasks does not have.

As mentioned above, most of the current object detection networks only connect the features obtained by backbone with subsequent networks in a top-down approach [8] for joint training, making the extracted features suitable for the detection task. However, a large amount of feature information will be lost after multiple convolution operations and pooling operations during the feature extraction stage, which seriously affects the accuracy of localization [9]. Therefore, detecting objects of different scales has become an urgent problem of object detection, especially for improving the weak accuracy in small instances [10].This phenomenon is more serious in a one-stage object detector [2], [4], [5], [11] that does not rely on the extraction of candidate regions, but directly presents the final detection result through regression.

Among single-stage methods, SSD [2] has a good performance in terms of speed and accuracy balance. VGG-16 [3] is adopted as the basic extraction network in SSD to reduce the burden of single-scale features by constructing a multi-scale feature pyramid (see Fig. 1(a)). However, its detection accuracy for small objects is not as satisfactory as expected. This may be limited by the information in the shallow or previous layers [12]. In order to alleviate the lack of detailed information due to insufficient shallow representation capabilities, possible solutions have been suggested in some of the literature [8], [9], [13], [14], [15], [16], [17]. Networks such as STDN [13] and DetNet [9] address the problem of insufficient representation of the feature maps extracted from the shallow layers by using stronger backbone networks (ResNet [7] and DenseNet [18]), which take advantage of skip connections to introduce the details of the previous layers. However, introducing a deeper backbone will make the network more complicated and cause a loss in speed. Snip [14] introduces the idea of image pyramids, which allow networks to learn the effects of scale changes by using images at different scales as inputs (see Fig. 1(b)), but it is extremely computationally expensive because it requires multiple passes on an image with different parameters. Networks such as DSSD [16] and RSSD [17] inherit FPN [8] concepts to establish interacted pyramids (see Fig. 1(c)) in which it obtains a more efficient representation of contextual information by fusing up-sampled feature maps with different resolutions. However, the high level has lost the semantic information of small objects due to multiple downsampling. Therefore, even if the top-down structure is used, there is little effective small object semantic feature for small object classification [14].

In this work, in order to improve the small object performance of the single shot detector, an auxiliary interactive pyramid network is presented to obtain highly representive features for small objects detection. First we scale the input to the multiple scales with each passing a lightweight module to extract detailed features, which are used to enhance the representation of the low −/mid-layer features from the baseline. This operation builds an interactive inter-layer feature pyramid with multiple inputs (see Fig. 1(d)) to realize effective multi-scale feature communication. Furthermore, an adaptive interaction module is designed by performing multi-scale fusion on the enhanced low and middle layers to realize the interaction of contextual information. It is portable without destroying the original network structure and plays a complementary role to the loss of information due to multiple downsampling operations.

In summary, the main contributions of the work are as follows:

  • (1)

    An auxiliary multi-scale enhancement network is designed to supplement the low −/mid-level information through interactive multi-input information. Without changing the original network structure, it has made a significant improvement for the model's detection effect on small and medium objects.

  • (2)

    An adaptive interaction module is designed for the interaction of the enchanced multi-scale features. The fusion ratio is learned from the network. It further enriches the representation of features by aggregating the context information of adjacent layers.

  • (3)

    Comprehensive experiments are carried out on PASCAL VOC and MS COCO, which show the effectiveness and portability of the proposed method.

Section snippets

Related work

Popular object detection algorithms can be divided into two categories: The first is the candidate region based algorithm, called two-stage algorithm, which needs to first use the region proposal network (RPN) to generate candidate regions, and then perform classification and regression on the candidate regions to obtain the final detection results after two stages. The standard two-stage models include RCNN [19], SPPNet [20], Fast RCNN [21], Faster RCNN [6] and R-FCN [22], which have high

Method

SSD uses a VGG-16 [3] backbone for feature extraction, and the extra network adds four cascading convolution layers. The image passes through the network from left to right. Generate a series of feature maps, the feature maps have six layers: the last two block layers from original VGG-16 and the four newly added extra layers, with feature size of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, 1 × 1, respectively. The number of channels is 512, 1024, 512, 256, 256, 256. Finally, the detection results

Experiments

In this section, we conducted comprehensive experiments on PASCAL VOC [27] and MS COCO [28]. The PASCAL VOC dataset contains 20 different object categories. We merge the VOC 2007 trainval set and the VOC 2012 trainval set with a total of 16,551 images for combined training, and evaluate on the PASCAL VOC 2007 test (4952 images). The mean average precision (mAP) is used as a measure. The MS COCO dataset contains 80 object categories, which are divided into 80 k training, 40 k validation and 40 k

Conclusion

In this paper, without changing the original network structure, a lightweight multi-scale feature representation enhancement module is introduced, which interacts with multiple input features of the same size as the prediction layer, supplementing the original features with more detailed features after only a small amount of convolution. Furthermore, an adaptive interaction module is designed to aggregate the enhanced features of adjacent scales to improve the baseline's ability to detect small

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (grant no. 61573168).

References (49)

  • T.Y. Lin et al.

    Feature pyramid networks for object detection

  • Z. Li et al.

    Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215

    (2018)
  • Y. Bai et al.

    Sod-mtgan: Small object detection via multi-task generative adversarial network

  • J. Redmon et al.

    Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

    (2018)
  • J. Huang et al.

    Speed/accuracy trade-offs for modern convolutional object detectors

  • P. Zhou et al.

    Scale-transferrable object detection

  • B. Singh et al.

    An analysis of scale invariance in object detection - snip

  • E.H. Adelson et al.

    Pyramid methods in image processing

    RCA Engineer

    (1984)
  • C.Y. Fu et al.

    Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

    (2017)
  • J. Jeong et al.

    Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587

    (2017)
  • G. Huang et al.

    Densely connected convolutional networks

  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • K. He et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • R. Girshick

    Fast r-cnn

  • Cited by (13)

    • Deep learning-based detection from the perspective of small or tiny objects: A survey

      2022, Image and Vision Computing
      Citation Excerpt :

      It can further improve detection performance on large, medium and small objects. In particular, through using multi-scale testing, IMFRE512 [129] improves the detection performance of small objects by 6%. In Table 27 and Table 28, we compare some detectors from the SOD and USC-GRAD-STDdb dataset, respectively.

    • An Intelligent Optimization Based Yolov5 Framework to Detect the Rice Leaf Disease

      2023, 2023 3rd Asian Conference on Innovation in Technology, ASIANCON 2023
    View all citing articles on Scopus

    This work was supported by the National Natural Science Foundation of China under Grant 61573168

    View full text