Interactive multi-scale feature representation enhancement for small object detection☆
Introduction
At present, deep learning-based object detection models all rely on CNN [1] classification networks as feature extractors. For example, Single-shot detection (SSD) [2] uses VGG [3], YOLO [4] uses DarkNet [5], and Faster R-CNN [6] uses ResNet [7]. We generally call these networks the backbone of the object detection model. For classification tasks, the accuracy of classification depends on the quality of feature extraction. Strided convolutions and max-pooling are considered an essential part of a convolutional backbone. They help CNNs perform feature extraction on a larger image region and increase the receptive field, which works quite well in practice for classification. Considering that the object detection task also requires the identification of classes of objects in the image, a very straightforward idea is to directly use the image classification network for object detection tasks. However, in addition to classification, object detection is more important in localization, which is precisely what the network for classification tasks does not have.
As mentioned above, most of the current object detection networks only connect the features obtained by backbone with subsequent networks in a top-down approach [8] for joint training, making the extracted features suitable for the detection task. However, a large amount of feature information will be lost after multiple convolution operations and pooling operations during the feature extraction stage, which seriously affects the accuracy of localization [9]. Therefore, detecting objects of different scales has become an urgent problem of object detection, especially for improving the weak accuracy in small instances [10].This phenomenon is more serious in a one-stage object detector [2], [4], [5], [11] that does not rely on the extraction of candidate regions, but directly presents the final detection result through regression.
Among single-stage methods, SSD [2] has a good performance in terms of speed and accuracy balance. VGG-16 [3] is adopted as the basic extraction network in SSD to reduce the burden of single-scale features by constructing a multi-scale feature pyramid (see Fig. 1(a)). However, its detection accuracy for small objects is not as satisfactory as expected. This may be limited by the information in the shallow or previous layers [12]. In order to alleviate the lack of detailed information due to insufficient shallow representation capabilities, possible solutions have been suggested in some of the literature [8], [9], [13], [14], [15], [16], [17]. Networks such as STDN [13] and DetNet [9] address the problem of insufficient representation of the feature maps extracted from the shallow layers by using stronger backbone networks (ResNet [7] and DenseNet [18]), which take advantage of skip connections to introduce the details of the previous layers. However, introducing a deeper backbone will make the network more complicated and cause a loss in speed. Snip [14] introduces the idea of image pyramids, which allow networks to learn the effects of scale changes by using images at different scales as inputs (see Fig. 1(b)), but it is extremely computationally expensive because it requires multiple passes on an image with different parameters. Networks such as DSSD [16] and RSSD [17] inherit FPN [8] concepts to establish interacted pyramids (see Fig. 1(c)) in which it obtains a more efficient representation of contextual information by fusing up-sampled feature maps with different resolutions. However, the high level has lost the semantic information of small objects due to multiple downsampling. Therefore, even if the top-down structure is used, there is little effective small object semantic feature for small object classification [14].
In this work, in order to improve the small object performance of the single shot detector, an auxiliary interactive pyramid network is presented to obtain highly representive features for small objects detection. First we scale the input to the multiple scales with each passing a lightweight module to extract detailed features, which are used to enhance the representation of the low −/mid-layer features from the baseline. This operation builds an interactive inter-layer feature pyramid with multiple inputs (see Fig. 1(d)) to realize effective multi-scale feature communication. Furthermore, an adaptive interaction module is designed by performing multi-scale fusion on the enhanced low and middle layers to realize the interaction of contextual information. It is portable without destroying the original network structure and plays a complementary role to the loss of information due to multiple downsampling operations.
In summary, the main contributions of the work are as follows:
- (1)
An auxiliary multi-scale enhancement network is designed to supplement the low −/mid-level information through interactive multi-input information. Without changing the original network structure, it has made a significant improvement for the model's detection effect on small and medium objects.
- (2)
An adaptive interaction module is designed for the interaction of the enchanced multi-scale features. The fusion ratio is learned from the network. It further enriches the representation of features by aggregating the context information of adjacent layers.
- (3)
Comprehensive experiments are carried out on PASCAL VOC and MS COCO, which show the effectiveness and portability of the proposed method.
Section snippets
Related work
Popular object detection algorithms can be divided into two categories: The first is the candidate region based algorithm, called two-stage algorithm, which needs to first use the region proposal network (RPN) to generate candidate regions, and then perform classification and regression on the candidate regions to obtain the final detection results after two stages. The standard two-stage models include RCNN [19], SPPNet [20], Fast RCNN [21], Faster RCNN [6] and R-FCN [22], which have high
Method
SSD uses a VGG-16 [3] backbone for feature extraction, and the extra network adds four cascading convolution layers. The image passes through the network from left to right. Generate a series of feature maps, the feature maps have six layers: the last two block layers from original VGG-16 and the four newly added extra layers, with feature size of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, 1 × 1, respectively. The number of channels is 512, 1024, 512, 256, 256, 256. Finally, the detection results
Experiments
In this section, we conducted comprehensive experiments on PASCAL VOC [27] and MS COCO [28]. The PASCAL VOC dataset contains 20 different object categories. We merge the VOC 2007 trainval set and the VOC 2012 trainval set with a total of 16,551 images for combined training, and evaluate on the PASCAL VOC 2007 test (4952 images). The mean average precision (mAP) is used as a measure. The MS COCO dataset contains 80 object categories, which are divided into 80 k training, 40 k validation and 40 k
Conclusion
In this paper, without changing the original network structure, a lightweight multi-scale feature representation enhancement module is introduced, which interacts with multiple input features of the same size as the prediction layer, supplementing the original features with more detailed features after only a small amount of convolution. Furthermore, an adaptive interaction module is designed to aggregate the enhanced features of adjacent scales to improve the baseline's ability to detect small
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (grant no. 61573168).
References (49)
- et al.
Object detection network based on feature fusion and attention mechanism
Future Internet
(2019) - et al.
Single-shot bidirectional pyramid networks for high-quality object detection
Neurocomputing
(2020) - et al.
Tdfssd: top-down feature fusion single shot multibox detector
Signal Process. Image Commun.
(2020) - et al.
Imagenet classification with deep convolutional neural networks
- et al.
Ssd: Single shot multibox detector
- et al.
Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
(2014) - et al.
You only look once: Unified, real-time object detection
- et al.
Yolo9000: better, faster, stronger
- et al.
Faster r-cnn: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2016) - et al.
Deep residual learning for image recognition
Feature pyramid networks for object detection
Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215
Sod-mtgan: Small object detection via multi-task generative adversarial network
Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Speed/accuracy trade-offs for modern convolutional object detectors
Scale-transferrable object detection
An analysis of scale invariance in object detection - snip
Pyramid methods in image processing
RCA Engineer
Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587
Densely connected convolutional networks
Rich feature hierarchies for accurate object detection and semantic segmentation
Spatial pyramid pooling in deep convolutional networks for visual recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Fast r-cnn
Cited by (13)
Swin transformer based vehicle detection in undisciplined traffic environment
2023, Expert Systems with ApplicationsDeep learning-based detection from the perspective of small or tiny objects: A survey
2022, Image and Vision ComputingCitation Excerpt :It can further improve detection performance on large, medium and small objects. In particular, through using multi-scale testing, IMFRE512 [129] improves the detection performance of small objects by 6%. In Table 27 and Table 28, we compare some detectors from the SOD and USC-GRAD-STDdb dataset, respectively.
I-YOLO: a novel single-stage framework for small object detection
2024, Visual ComputerAn Intelligent Optimization Based Yolov5 Framework to Detect the Rice Leaf Disease
2023, 2023 3rd Asian Conference on Innovation in Technology, ASIANCON 2023Vehicle detection in diverse traffic using an ensemble convolutional neural backbone via feature concatenation
2023, Transportation Letters
- ☆
This work was supported by the National Natural Science Foundation of China under Grant 61573168