Skip to main content
Log in

Multi-view aggregation for real-time accurate object detection of a moving camera

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Object detection plays an important role on various mobile robot tasks. However, directly applying existing detectors on videos from a mobile robot will cause a sharp accuracy decline, because such videos introduce some extra difficulties on accurate detection. This paper proposes a viewpoint-based memory mechanism to handle detection performance deterioration and improve detection accuracy of the videos in real time. The mechanism positively organizes previous results from multiple viewpoints of target objects as prior knowledge to enhance detection accuracy for succeeding frames, and it is designed as an extension module of an existing image detector. In experiments, we collect testing dataset from an indoor mobile robot, and compare performance of several sole image detectors and the same detectors extended by the extension module. The result shows the mechanism module achieves 20.7% object localization rate margin in average at a cost of 18.1 ms, and the mechanism can give positive impact on various existing detectors. The result indicates the proposed method achieves good accuracy margin, has acceptable time cost, and gets a degree of universal applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755. Springer (2014)

  2. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer (2016)

  3. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  4. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

  5. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  6. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

  7. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: object detection via region-based fully convolutional networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 379–387 (2016)

  8. Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 817–825 (2016)

  9. Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al.: T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2896–2907 (2017)

    Article  Google Scholar 

  10. Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)

  11. Deng, H., Hua, Y., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., Guan, H.: Object guided external memory network for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6678–6687 (2019)

  12. Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)

  13. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)

  14. Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)

  15. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Ieee (2009)

  16. Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)

  17. Hu, J., Wang, T., Li, Y., Zhu, S.: A novel memory mechanism for video object detection from indoor mobile robots. SIViP 15(8), 1785–1795 (2021)

    Article  Google Scholar 

  18. Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., Wang, X.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017)

  19. Hu, J., Wang, T., Zhu, S.: Viewpoint-based memory mechanism for object detection of moving sensors. In: Proceedings of the IEEE Conference on Intelligent Systems and Computer Vision, pp. 1–7 (2022)

  20. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014)

  21. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)

  22. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence. 39(6), 1137–1149 (2017)

  23. Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016)

  24. Konno, T., Amma, A., Kanezaki, A.: Incremental multi-view object detection from a moving camera. In: Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–7 (2021)

  25. Tripathi, S., Lipton, Z.C., Belongie, S., Nguyen, T.: Context matters: refining object detection in video with recurrent neural networks. arXiv preprint arXiv:1607.04648 (2016)

  26. Broad, A., Jones, M., Lee, T.Y.: Recurrent multi-frame single shot detector for video object detection. In: BMVC, p. 94 (2018)

  27. Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346 (2018)

  28. Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501 (2018)

  29. Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5686–5695 (2018)

  30. Liu, M., Zhu, M., White, M., Li, Y., Kalenichenko, D.: Looking fast and slow: memory-guided mobile video object detection. arXiv preprint arXiv:1903.10172 (2019)

  31. Melonee W., T.F.: Specification for turtlebot compatible platforms. ROSWeb. https://www.ros.org/reps/rep-0119.html (2021). (Accessed 1 March 2022)

  32. Deselaers, T., Alexe, B., Ferrari, V.: Localizing objects while learning their appearance. In: European conference on computer vision, pp. 452–466. Springer (2010)

Download references

Acknowledgements

This work was supported by Robotics Institute of Zhejiang University under Grant 110202-I21707.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, J., Wang, T. & Zhu, S. Multi-view aggregation for real-time accurate object detection of a moving camera. J Real-Time Image Proc 19, 1169–1179 (2022). https://doi.org/10.1007/s11554-022-01253-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-022-01253-9

Keywords

Navigation