Interactive multi-scale feature representation enhancement for small object detection

doi:10.1016/j.imavis.2021.104128

Image and Vision Computing

Volume 108, April 2021, 104128

https://doi.org/10.1016/j.imavis.2021.104128 Get rights and content

Highlights

•
An auxiliary multi-scale input network for enhancing shallow features.
•
Interactive multi-scale features for aggregating contextual information.
•
An adaptive factor embedded in the interactive module to control the fusion ratio.

Abstract

In the field of detection, there is a wide gap between the performance of small objects and that of medium, large objects. Some studies show that this gap is due to the contradiction between the classification-based backbone and localization. Although the reduction in the feature map size is beneficial for the extraction of abstract features, it will cause the loss of detailed features in the localization as traversing the backbone. Therefore, an interactive multi-scale feature representation enhancement strategy is proposed. This strategy includes two modules: first a multi-scale auxiliary enhancement network is proposed for feature interaction under multiple inputs. We scale the input to multiple scales corresponding to the prediction layers, and only passes through the lightweight extraction module to extract more detailed features for enhancing the original futures. Moreover, an adaptive interaction module is designed to aggregate the features of adjacent layers. This approach provides flexibility in achieving the improvement of small objects detection ability without changing the original network structure. Comprehensive experimental results based on PASCAL VOC and MS COCO datasets show the effectiveness of the proposed method.

Introduction

At present, deep learning-based object detection models all rely on CNN [1] classification networks as feature extractors. For example, Single-shot detection (SSD) [2] uses VGG [3], YOLO [4] uses DarkNet [5], and Faster R-CNN [6] uses ResNet [7]. We generally call these networks the backbone of the object detection model. For classification tasks, the accuracy of classification depends on the quality of feature extraction. Strided convolutions and max-pooling are considered an essential part of a convolutional backbone. They help CNNs perform feature extraction on a larger image region and increase the receptive field, which works quite well in practice for classification. Considering that the object detection task also requires the identification of classes of objects in the image, a very straightforward idea is to directly use the image classification network for object detection tasks. However, in addition to classification, object detection is more important in localization, which is precisely what the network for classification tasks does not have.

As mentioned above, most of the current object detection networks only connect the features obtained by backbone with subsequent networks in a top-down approach [8] for joint training, making the extracted features suitable for the detection task. However, a large amount of feature information will be lost after multiple convolution operations and pooling operations during the feature extraction stage, which seriously affects the accuracy of localization [9]. Therefore, detecting objects of different scales has become an urgent problem of object detection, especially for improving the weak accuracy in small instances [10].This phenomenon is more serious in a one-stage object detector [2], [4], [5], [11] that does not rely on the extraction of candidate regions, but directly presents the final detection result through regression.

Among single-stage methods, SSD [2] has a good performance in terms of speed and accuracy balance. VGG-16 [3] is adopted as the basic extraction network in SSD to reduce the burden of single-scale features by constructing a multi-scale feature pyramid (see Fig. 1(a)). However, its detection accuracy for small objects is not as satisfactory as expected. This may be limited by the information in the shallow or previous layers [12]. In order to alleviate the lack of detailed information due to insufficient shallow representation capabilities, possible solutions have been suggested in some of the literature [8], [9], [13], [14], [15], [16], [17]. Networks such as STDN [13] and DetNet [9] address the problem of insufficient representation of the feature maps extracted from the shallow layers by using stronger backbone networks (ResNet [7] and DenseNet [18]), which take advantage of skip connections to introduce the details of the previous layers. However, introducing a deeper backbone will make the network more complicated and cause a loss in speed. Snip [14] introduces the idea of image pyramids, which allow networks to learn the effects of scale changes by using images at different scales as inputs (see Fig. 1(b)), but it is extremely computationally expensive because it requires multiple passes on an image with different parameters. Networks such as DSSD [16] and RSSD [17] inherit FPN [8] concepts to establish interacted pyramids (see Fig. 1(c)) in which it obtains a more efficient representation of contextual information by fusing up-sampled feature maps with different resolutions. However, the high level has lost the semantic information of small objects due to multiple downsampling. Therefore, even if the top-down structure is used, there is little effective small object semantic feature for small object classification [14].

In this work, in order to improve the small object performance of the single shot detector, an auxiliary interactive pyramid network is presented to obtain highly representive features for small objects detection. First we scale the input to the multiple scales with each passing a lightweight module to extract detailed features, which are used to enhance the representation of the low −/mid-layer features from the baseline. This operation builds an interactive inter-layer feature pyramid with multiple inputs (see Fig. 1(d)) to realize effective multi-scale feature communication. Furthermore, an adaptive interaction module is designed by performing multi-scale fusion on the enhanced low and middle layers to realize the interaction of contextual information. It is portable without destroying the original network structure and plays a complementary role to the loss of information due to multiple downsampling operations.

In summary, the main contributions of the work are as follows:

(1)
An auxiliary multi-scale enhancement network is designed to supplement the low −/mid-level information through interactive multi-input information. Without changing the original network structure, it has made a significant improvement for the model's detection effect on small and medium objects.
(2)
An adaptive interaction module is designed for the interaction of the enchanced multi-scale features. The fusion ratio is learned from the network. It further enriches the representation of features by aggregating the context information of adjacent layers.
(3)
Comprehensive experiments are carried out on PASCAL VOC and MS COCO, which show the effectiveness and portability of the proposed method.

Section snippets

Related work

Popular object detection algorithms can be divided into two categories: The first is the candidate region based algorithm, called two-stage algorithm, which needs to first use the region proposal network (RPN) to generate candidate regions, and then perform classification and regression on the candidate regions to obtain the final detection results after two stages. The standard two-stage models include RCNN [19], SPPNet [20], Fast RCNN [21], Faster RCNN [6] and R-FCN [22], which have high

Method

SSD uses a VGG-16 [3] backbone for feature extraction, and the extra network adds four cascading convolution layers. The image passes through the network from left to right. Generate a series of feature maps, the feature maps have six layers: the last two block layers from original VGG-16 and the four newly added extra layers, with feature size of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, 1 × 1, respectively. The number of channels is 512, 1024, 512, 256, 256, 256. Finally, the detection results

Experiments

In this section, we conducted comprehensive experiments on PASCAL VOC [27] and MS COCO [28]. The PASCAL VOC dataset contains 20 different object categories. We merge the VOC 2007 trainval set and the VOC 2012 trainval set with a total of 16,551 images for combined training, and evaluate on the PASCAL VOC 2007 test (4952 images). The mean average precision (mAP) is used as a measure. The MS COCO dataset contains 80 object categories, which are divided into 80 k training, 40 k validation and 40 k

Conclusion

In this paper, without changing the original network structure, a lightweight multi-scale feature representation enhancement module is introduced, which interacts with multiple input features of the same size as the prediction layer, supplementing the original features with more detailed features after only a small amount of convolution. Furthermore, an adaptive interaction module is designed to aggregate the enhanced features of adjacent scales to improve the baseline's ability to detect small

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (grant no. 61573168).

References (49)

Y. Zhang et al.
Object detection network based on feature fusion and attention mechanism
Future Internet
(2019)
X. Wu et al.
Single-shot bidirectional pyramid networks for high-quality object detection
Neurocomputing
(2020)
H. Pan et al.
Tdfssd: top-down feature fusion single shot multibox detector
Signal Process. Image Commun.
(2020)
A. Krizhevsky et al.
Imagenet classification with deep convolutional neural networks
W. Liu et al.
Ssd: Single shot multibox detector
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
(2014)
J. Redmon et al.
You only look once: Unified, real-time object detection
J. Redmon et al.
Yolo9000: better, faster, stronger
S. Ren et al.
Faster r-cnn: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2016)
K. He et al.
Deep residual learning for image recognition

T.Y. Lin et al.

Feature pyramid networks for object detection

Z. Li et al.

Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215

(2018)

Y. Bai et al.

Sod-mtgan: Small object detection via multi-task generative adversarial network

J. Redmon et al.

Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

(2018)

J. Huang et al.

Speed/accuracy trade-offs for modern convolutional object detectors

P. Zhou et al.

Scale-transferrable object detection

B. Singh et al.

An analysis of scale invariance in object detection - snip

E.H. Adelson et al.

Pyramid methods in image processing

RCA Engineer

(1984)

C.Y. Fu et al.

Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

(2017)

J. Jeong et al.

Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587

(2017)

G. Huang et al.

Densely connected convolutional networks

R. Girshick et al.

Rich feature hierarchies for accurate object detection and semantic segmentation

K. He et al.

Spatial pyramid pooling in deep convolutional networks for visual recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

R. Girshick

Fast r-cnn

Cited by (13)

Deep learning based object detection for resource constrained devices: Systematic review, future trends and challenges ahead
2023, Neurocomputing
Deep learning models are widely being employed for object detection due to their high performance. However, the majority of applications that require object detection are functioning on resource-constrained edge devices. In the present era, there is a need for deep learning-based object detectors that are lightweight and perform well on these constrained edge devices.
Objective: The research aims to identify current trends in resource-constrained applications for deep learning-based object detectors in terms of the technique used to create the model, the type of input image involved, the type of device used, and the type of application addressed by the model.
Method: To achieve the objective of our research, a systematic literature review was carried out that yielded 167 studies. The models or techniques employed in the studies were grouped to better understand the research problem at hand. This review carefully reports every decision and provides many visualizations of the final studies in order to draw clear conclusions.
Conclusion: The conclusion discussed the gaps, possibilities, and future perspectives discovered throughout the research process, implying that this field of study has grown profoundly in the last decade.
Swin transformer based vehicle detection in undisciplined traffic environment
2023, Expert Systems with Applications
Intelligent vehicle detection (IVD) plays a prominent role in evolving an intelligent traffic management system (ITMS). It can help to decrease the average waiting time at the traffic post, save fuel consumption, control traffic congestion, decrease accident rates, and build up human safety. Recent developments in the artificial intelligence (AI) domain have increased the demand for IVD in the undisciplined traffic environment, which is a usual condition in developing countries. IVD is a difficult task in an undisciplined traffic environment because different vehicle categories travel very close to each other on the roads and do not follow traffic rules. Previously, several convolutional neural network (CNN) based deep learning (DL), and visual transformer-based techniques for vehicle and object detection have been presented. They are complex and do not accurately extract multi-scale features due to the involvement of existing CNN feature extraction backbones. Also, most techniques failed to account for an undisciplined traffic environment due to the unavailability of labeled vehicle datasets. Therefore, this paper proposes a swin transformer-based vehicle detection (STVD) framework in an undisciplined traffic environment. Swin transformer (ST) wholly exchanges information within and between image patches and provides hierarchical feature maps, effectively alleviating the multi-scale feature extraction problem. A bi-directional feature pyramid network (BIFPN) is presented, which combines low-resolution features with high-resolution features in a bidirectional way and provides robust multi-scale features with different scales and resolutions. A fully connected vehicle detection head (FCVDH) is applied to improve the matching relationship between vehicle sizes and the BIFPN hierarchy. FCVDH predicts the locations and categories of vehicles in the input image. STVD is analyzed, experimented, and measured over realistic traffic data. Also, it is compared with the existing state-of-the-art vehicle detection methods. It achieves 91.32% detection accuracy on diverse traffic labeled dataset (DTLD), 87.4% on IITM-hetra, and 88.45% on KITTI datasets.
Deep learning-based detection from the perspective of small or tiny objects: A survey
2022, Image and Vision Computing
Citation Excerpt :
It can further improve detection performance on large, medium and small objects. In particular, through using multi-scale testing, IMFRE512 [129] improves the detection performance of small objects by 6%. In Table 27 and Table 28, we compare some detectors from the SOD and USC-GRAD-STDdb dataset, respectively.
Detecting small or tiny objects is always a difficult and challenging issue in computer vision. In this paper, we provide a latest and comprehensive survey of deep learning-based detection approaches from the perspective of small or tiny objects. Our survey is featured by thorough and exhaustive analysis of small or tiny object detection. We comprehensively introduce 30 existing datasets about small or tiny objects, and summarize different definitions of small or tiny objects based on different application scenarios, such as pedestrian detection, traffic signs detection, face detection, remote sensing target detection and object detection in common life. Then small or tiny object detection techniques are overviewed systematically from seven aspects, including super-resolution techniques, context-based information, multi-scale representation learning, anchor mechanism, training strategy, data augmentation, and schemes based on loss function. Finally, the detection performance of small or tiny objects on 12 popular datasets is analyzed in depth. Based on performance analysis, we also discuss the promising research directions in the future. We hope this survey could provide researchers guidance to catalyze understanding of small or tiny object detection and further facilitate research on small or tiny object detection systems.
I-YOLO: a novel single-stage framework for small object detection
2024, Visual Computer
An Intelligent Optimization Based Yolov5 Framework to Detect the Rice Leaf Disease
2023, 2023 3rd Asian Conference on Innovation in Technology, ASIANCON 2023
Vehicle detection in diverse traffic using an ensemble convolutional neural backbone via feature concatenation
2023, Transportation Letters

View all citing articles on Scopus

^☆: This work was supported by the National Natural Science Foundation of China under Grant 61573168

View full text

Interactive multi-scale feature representation enhancement for small object detection☆

Highlights

Abstract

Introduction

Section snippets

Related work

Method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgement

Future Internet

Neurocomputing

Signal Process. Image Commun.

Imagenet classification with deep convolutional neural networks

Ssd: Single shot multibox detector

Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

You only look once: Unified, real-time object detection

Yolo9000: better, faster, stronger

Faster r-cnn: towards real-time object detection with region proposal networks

IEEE Trans. Pattern Anal. Mach. Intell.

Deep residual learning for image recognition

Feature pyramid networks for object detection

Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215

Sod-mtgan: Small object detection via multi-task generative adversarial network

Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

Speed/accuracy trade-offs for modern convolutional object detectors

Scale-transferrable object detection

An analysis of scale invariance in object detection - snip

Pyramid methods in image processing

RCA Engineer

Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587

Densely connected convolutional networks

Rich feature hierarchies for accurate object detection and semantic segmentation

Spatial pyramid pooling in deep convolutional networks for visual recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Fast r-cnn