Skip to main content
Log in

RSANet: Towards Real-Time Object Detection with Residual Semantic-Guided Attention Feature Pyramid Network

  • Published:
Mobile Networks and Applications Aims and scope Submit manuscript

Abstract

The huge computational overhead limits the inference of convolutional neural networks on mobile devices for object detection, which plays a critical role in many real-world scenes, such as face identification, autonomous driving, and video surveillance. To solve this problem, this paper introduces a lightweight convolutional neural network, called RSANet: Towards Real-time Object Detection with Residual Semantic-guided Attention Feature Pyramid Network. Our RSANet consists of two parts: (a) Lightweight Convolutional Network (LCNet) as backbone, and (b) Residual Semantic-guided Attention Feature Pyramid Network (RSAFPN) as detection head. In the LCNet, in contrast to recent advances of lightweight networks that prefer to utilize pointwise convolution for changing the number of feature maps, we design a Constant Channel Module (CCM) to save the Memory Access Cost (MAC) and design Down Sampling Module (DSM) to save the computational cost. In the RSAFPN, meanwhile, we employ Residual Semantic-guided Attention Mechanism (RSAM) to fuse the multi-scale features from LCNet for improving detection performance efficiently. The experiment results show that, on PASCAL VOC 20007 dataset, RSANet only requires 3.24 M model size and needs only 3.54B FLOPs with a 416×416 input image. Compared to YOLO Nano, our method obtains a 6.7% improvement in accuracy and requires less computation. On MS COCO dataset, RSANet only requires 4.35 M model size and needs only 2.34B FLOPs with a 320×320 input image. Our method obtains a 1.3% improvement in accuracy compared to Pelee. The comprehensive experiment results demonstrate that our model achieves promising results in terms of available speed and accuracy trade-off.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  2. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  3. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  4. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  5. Nakayama Y, Lu H, Li Y, Kamiya T (2020) WideSegNeXt: semantic image segmentation using wide residual network and NeXt dilated unit. IEEE Sensors Journal

  6. Lu W, Zhang X, Lu H, Li F (2020) Deep hierarchical encoding model for sentence semantic matching. Journal of Visual Communication and Image Representation, 102794

  7. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  8. Van de Sande KE, Uijlings JR, Gevers T, Smeulders AW (2011) Segmentation as selective search for object recognition. In: 2011 International conference on computer vision. IEEE, pp 1879– 1886

  9. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  10. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  11. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  12. Howard A, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861

  13. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856

  14. Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710

  15. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520

  16. Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131

  17. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  18. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  19. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767

  20. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, Cham, pp 21–37

  21. Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv:1701.06659

  22. Zhang S, Wen L, Bian X, Lei Z, Li S (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4203–4212

  23. Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for mobilenetv3. In: Proceedings of the IEEE international conference on computer vision, pp 1314–1324

  24. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360

  25. Wang RJ, Li X, Ling CX (2018) Pelee: a real-time object detection system on mobile devices. In: Advances in neural information processing systems, pp 1963–1972

  26. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  27. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  28. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  29. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.44001312.4400

  30. Wong A, Famuori M, Shafiee MJ, Li F, Chwyl B, Chung J (2019) YOLO nano: a highly compact you only look once convolutional neural network for object detection. arXiv:1910.01271

  31. Cong D, Zhou Q, Cheng J, Wu X, Zhang S, Ou W, Lu H (2019) CAN: contextual aggregating network for semantic segmentation. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1892–1896

  32. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) results

  33. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755

  34. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Deven M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp 265–283

  35. Shen Z, Liu Z, Li J, Jiang YG, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE international conference on computer vision, pp 1919–1927

  36. Mehta R, Ozturk C (2018) Object detection at 200 frames per second. In: Proceedings of the European conference on computer vision (ECCV)

  37. Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2017) Light-head r-cnn: in defense of two-stage object detector. arXiv:1711.07264

  38. Wang Y, Zhou Q, Liu J, Xiong J, Gao G, Wu X, Latecki LJ (2019) Lednet: a lightweight encoder-decoder network for real-time semantic segmentation. In: 2019 IEEE International conference on image processing (ICIP). IEEE, pp 1860–1864

  39. Liu J, Zhou Q, Qiang Y, Kang B, Wu X, Zheng B (IEEE) FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2373–2377

  40. Li S, Zhou Q, Liu J, Wang J, Fan Y, Wu X, Latecki LJ (2020) IEEE 27th Int. conf. on image processing (ICIP) virtual

  41. Wang Z, Zheng L, Liu Y, Wang S (2019) Towards real-time multi-object tracking. arXiv:1909.12605

  42. Zhan Y, Wang C, Wang X, Zeng W, Liu W (2020) A simple baseline for multi-object tracking. arXiv:2004.01888

  43. Lu Z, Rathod V, Votel R, Huang J (2020) RetinaTrack: online single stage joint detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14668–14678

  44. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255

  45. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks, pp 4700–4708

  46. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

  47. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  48. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.031671502.03167

  49. Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934

  50. Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  51. Tian Z, Shen C, Chen H, He T (2019) Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636

  52. Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9259–9266

  53. Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790

Download references

Acknowledgments

National Natural Science Foundation of China (61876093, 61671253), National Natural Science Foundation of Jiangsu Province(BK20181393), and China Scholarship Council(201908320072).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Zhou.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Q., Wang, J., Liu, J. et al. RSANet: Towards Real-Time Object Detection with Residual Semantic-Guided Attention Feature Pyramid Network. Mobile Netw Appl 26, 77–87 (2021). https://doi.org/10.1007/s11036-020-01723-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11036-020-01723-z

Keywords

Navigation