Abstract
Object detection is a valuable but challenging technology in computer vision research. Although existing methods could attain satisfactory results on high-performance computers, but the huge number of network parameters brings great operating pressure to the mobile devices with limited computing power. Existing methods are usually in a dilemma between accuracy and speed. The low detection effect brings great difficulties to the implementation of detection tasks. This paper optimizes the classic YOLOv4 and proposes the SlimYOLOv4 network structure. Firstly, we change the feature extraction network from CSPDarknet53 to MobileNetV2. Secondly, more appropriate DO-DConv (depthwise over-parameterized depthwise convolutional layer) and DSC (depthwise separable convolution) were selected to replace the standard convolution in the network structure, which greatly reduces computation and improves network performance. Finally, Leaky ReLU is replaced by ReLU6 to improve the numerical resolution. We evaluate SlimYOLOv4 on Pascal VOC07+12 dataset and MS COCO dataset. The experimental results demonstrate that the parameters of our method account for only 12.6\(\%\) of YOLOv4, and the speed is 1.59 times that of YOLOv4, reaching 60.19 frames per second (FPS), which is suitable for real-time detection. It achieve 70.83\(\%\) mean average precision (mAP) on PASCAL VOC07+12 and 29.2\(\%\) mAP on the MS COCO dataset. As a lightweight object detector, it takes into account both speed and accuracy, which can be comparable to the state-of-the-art detectors as well.
Similar content being viewed by others
References
Algabri, M., Mathkour, H., Bencherif, M.A., Alsulaiman, M., Mekhtiche, M.A.: Towards deep object detection techniques for phoneme recognition. IEEE Access 8, 54663–54680 (2020). https://doi.org/10.1109/ACCESS.2020.2980452
Balasundaram, A., Chellappan, C.: An intelligent video analytics model for abnormal event detection in online surveillance video. J. Real Time Image Process. 17(4), 915–930 (2020). https://doi.org/10.1007/s11554-018-0840-6
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint. arXiv:2004.10934 (2020)
Cao, J., Li, Y., Sun, M., Chen, Y., Lischinski, D., Cohen-Or, D., Chen, B., Tu, C.: Do-conv: Depthwise over-parameterized convolutional layer. arXiv preprint. arXiv:2006.12030 (2020)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. arXiv:1610.02357v3 (2017)
Dai, X., Wan, A., Zhang, P., Wu, B., He, Z., Wei, Z., Chen, K., Tian, Y., Yu, M., Vajda, P., et al.: Fbnetv3: joint architecture-recipe search using neural acquisition function. arXiv e-prints, pp. arXiv–2006. arXiv:2006.02049 (2020)
Gao, Y., Xiao, G.: Real-time chinese traffic warning signs recognition based on cascade and cnn. J. Real Time Image Process. 18(3), 669–680 (2021). https://doi.org/10.1007/s11554-020-01003-9
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. arXiv:1504.08083 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81
Han, J., Yang, Y.: L-net: lightweight and fast object detector-based shufflenetv2. J. Real Time Image Process. (2021). https://doi.org/10.1007/s11554-021-01145-4
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015). https://doi.org/10.1109/TPAMI.2015.2389824
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. arXiv:1512.03385 (2016)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint. arXiv:1704.04861 (2017)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017). https://doi.org/10.1109/CVPR.2017.243
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311 (2017). https://doi.org/10.1109/CVPR.2017.351
Huang, R., Pedoeem, J., Chen, C.: Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. In: 2018 IEEE international conference on Big Data (Big Data), pp. 2503–2510. IEEE (2018). https://doi.org/10.1109/BigData.2018.8621865
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint. arXiv:1602.07360 (2016)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint. arXiv:1405.3866 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Kyrkou, C., Plastiras, G., Theocharides, T., Venieris, S.I., Bouganis, C.S.: Dronet: Efficient convolutional neural network detector for real-time uav applications. In: 2018 Design, automation and test in Europe conference and exhibition (DATE), pp. 967–972. IEEE (2018). https://doi.org/10.23919/DATE.2018.8342149
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Mao, H., Yao, S., Tang, T., Li, B., Yao, J., Wang, Y.: Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Top. Comput. 6(3), 417–431 (2016). https://doi.org/10.1109/TETC.2016.2593643
Mao, Y., He, Z., Ma, Z., Tang, X., Wang, Z.: Efficient convolution neural networks for object tracking using separable convolution and filter pruning. IEEE Access 7, 106466–106474 (2019). https://doi.org/10.1109/ACCESS.2019.2932733
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271 (2017). https://doi.org/10.1109/CVPR.2017.690
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint. arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474
Shafiee, M.J., Chywl, B., Li, F., Wong, A.: Fast yolo: a fast you only look once system for real-time embedded object detection in video. arXiv preprint. arXiv:1709.05943 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 (2014)
Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., Jia, J.: Learning shape-aware embedding for scene text detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4234–4243 (2019). https://doi.org/10.1109/CVPR.2019.00436
Wan, A., Dai, X., Zhang, P., He, Z., Tian, Y., Xie, S., Wu, B., Yu, M., Xu, T., Chen, K., et al.: Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12965–12974 (2020). https://doi.org/10.1109/CVPR42600.2020.01298
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: Cspnet: A new backbone that can enhance learning capability of cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 390–391 (2020). https://doi.org/10.1109/CVPRW50498.2020.00203
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10734–10742 (2019). https://doi.org/10.1109/CVPR.2019.01099
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500 (2017). https://doi.org/10.1109/CVPR.2017.634
Zhang, X., Xie, H., Zhao, Y., Qian, W., Xu, X.: A fast ssd model based on parameter reduction and dilated convolution. J Real Time Image Process. (2021). https://doi.org/10.1007/s11554-021-01108-9
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. arXiv:1707.01083 (2018)
Zhang, Y., Song, C., Zhang, D.: Deep learning-based object detection improvement for tomato disease. IEEE Access 8, 56607–56614 (2020). https://doi.org/10.1109/ACCESS.2020.2982456
Acknowledgements
This work is supported by the Key-Area Research and Development Program of Guangdong Province under Grant 2020B0909020001, the National Natural Science Foundation of China under Grant No.61573113.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ding, P., Qian, H. & Chu, S. SlimYOLOv4: lightweight object detector based on YOLOv4. J Real-Time Image Proc 19, 487–498 (2022). https://doi.org/10.1007/s11554-022-01201-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-022-01201-7