Abstract
The huge computational overhead limits the inference of convolutional neural networks on mobile devices for object detection, which plays a critical role in many real-world scenes, such as face identification, autonomous driving, and video surveillance. To solve this problem, this paper introduces a lightweight convolutional neural network, called RSANet: Towards Real-time Object Detection with Residual Semantic-guided Attention Feature Pyramid Network. Our RSANet consists of two parts: (a) Lightweight Convolutional Network (LCNet) as backbone, and (b) Residual Semantic-guided Attention Feature Pyramid Network (RSAFPN) as detection head. In the LCNet, in contrast to recent advances of lightweight networks that prefer to utilize pointwise convolution for changing the number of feature maps, we design a Constant Channel Module (CCM) to save the Memory Access Cost (MAC) and design Down Sampling Module (DSM) to save the computational cost. In the RSAFPN, meanwhile, we employ Residual Semantic-guided Attention Mechanism (RSAM) to fuse the multi-scale features from LCNet for improving detection performance efficiently. The experiment results show that, on PASCAL VOC 20007 dataset, RSANet only requires 3.24 M model size and needs only 3.54B FLOPs with a 416×416 input image. Compared to YOLO Nano, our method obtains a 6.7% improvement in accuracy and requires less computation. On MS COCO dataset, RSANet only requires 4.35 M model size and needs only 2.34B FLOPs with a 320×320 input image. Our method obtains a 1.3% improvement in accuracy compared to Pelee. The comprehensive experiment results demonstrate that our model achieves promising results in terms of available speed and accuracy trade-off.
Similar content being viewed by others
References
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Nakayama Y, Lu H, Li Y, Kamiya T (2020) WideSegNeXt: semantic image segmentation using wide residual network and NeXt dilated unit. IEEE Sensors Journal
Lu W, Zhang X, Lu H, Li F (2020) Deep hierarchical encoding model for sentence semantic matching. Journal of Visual Communication and Image Representation, 102794
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Van de Sande KE, Uijlings JR, Gevers T, Smeulders AW (2011) Segmentation as selective search for object recognition. In: 2011 International conference on computer vision. IEEE, pp 1879– 1886
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Howard A, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, Cham, pp 21–37
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv:1701.06659
Zhang S, Wen L, Bian X, Lei Z, Li S (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4203–4212
Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for mobilenetv3. In: Proceedings of the IEEE international conference on computer vision, pp 1314–1324
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360
Wang RJ, Li X, Ling CX (2018) Pelee: a real-time object detection system on mobile devices. In: Advances in neural information processing systems, pp 1963–1972
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.44001312.4400
Wong A, Famuori M, Shafiee MJ, Li F, Chwyl B, Chung J (2019) YOLO nano: a highly compact you only look once convolutional neural network for object detection. arXiv:1910.01271
Cong D, Zhou Q, Cheng J, Wu X, Zhang S, Ou W, Lu H (2019) CAN: contextual aggregating network for semantic segmentation. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1892–1896
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) results
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Deven M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp 265–283
Shen Z, Liu Z, Li J, Jiang YG, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE international conference on computer vision, pp 1919–1927
Mehta R, Ozturk C (2018) Object detection at 200 frames per second. In: Proceedings of the European conference on computer vision (ECCV)
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2017) Light-head r-cnn: in defense of two-stage object detector. arXiv:1711.07264
Wang Y, Zhou Q, Liu J, Xiong J, Gao G, Wu X, Latecki LJ (2019) Lednet: a lightweight encoder-decoder network for real-time semantic segmentation. In: 2019 IEEE International conference on image processing (ICIP). IEEE, pp 1860–1864
Liu J, Zhou Q, Qiang Y, Kang B, Wu X, Zheng B (IEEE) FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2373–2377
Li S, Zhou Q, Liu J, Wang J, Fan Y, Wu X, Latecki LJ (2020) IEEE 27th Int. conf. on image processing (ICIP) virtual
Wang Z, Zheng L, Liu Y, Wang S (2019) Towards real-time multi-object tracking. arXiv:1909.12605
Zhan Y, Wang C, Wang X, Zeng W, Liu W (2020) A simple baseline for multi-object tracking. arXiv:2004.01888
Lu Z, Rathod V, Votel R, Huang J (2020) RetinaTrack: online single stage joint detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14668–14678
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks, pp 4700–4708
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.031671502.03167
Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Tian Z, Shen C, Chen H, He T (2019) Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9259–9266
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Acknowledgments
National Natural Science Foundation of China (61876093, 61671253), National Natural Science Foundation of Jiangsu Province(BK20181393), and China Scholarship Council(201908320072).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, Q., Wang, J., Liu, J. et al. RSANet: Towards Real-Time Object Detection with Residual Semantic-Guided Attention Feature Pyramid Network. Mobile Netw Appl 26, 77–87 (2021). https://doi.org/10.1007/s11036-020-01723-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-020-01723-z