Skip to main content
Log in

Integration of top-down and bottom-up visual processing using a recurrent convolutional–deconvolutional neural network for semantic segmentation

  • Original Research Paper
  • Published:
Intelligent Service Robotics Aims and scope Submit manuscript

Abstract

Semantic segmentation has a wide array of applications such as scene understanding, autonomous driving, and robot manipulation tasks. While existing segmentation models have achieved good performance using bottom-up deep neural processing, this paper describes a novel deep learning architecture that integrates top-down and bottom-up processing. The resulting model achieves higher accuracy at a relatively low computational cost. In the proposed model, higher-level top-down information is transmitted to the lower layers through recurrent connections in an encoder and a decoder, and the recurrent connection weights are trained using backpropagation. Experiments on several benchmark datasets demonstrate that this use of top-down information improves the mean intersection over union by more than 3% compared with a state-of-the-art bottom-up only network using the CamVid, SUN-RGBD and PASCAL VOC 2012 benchmark datasets. Additionally, the proposed model is successfully applied to a dataset designed for robotic grasping tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  2. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1520–1528

  3. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  4. He K et al (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988

  5. Hariharan B et al (2015) Hypercolumns for object segmentation and finegrained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 447–456

  6. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv preprint arXiv:1409.1556

  7. Chen L-C et al (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: arXiv preprint arXiv:1606.00915

  8. Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Advances in neural information processing systems, pp 109–117

  9. Russell C, Kohli P, Torr PHS et al (2009) Associative hierarchical CRFs for object class image segmentation. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 739–746

  10. Uijlings JRR et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  11. Zitnick CL, Doll’ar P (2014) Edge boxes: locating object proposals from edges. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) European conference on computer vision. Springer, Berlin, pp 391–405

    Google Scholar 

  12. Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  13. Girshick R (2015) Fast r-cnn. In: arXiv preprint arXiv:1504.08083

  14. Goldstein EB (2014) Cognitive psychology: connecting mind, research and everyday experience. Nelson Education, Scarborough

    Google Scholar 

  15. Cichy RM, Pantazis D, Oliva A (2014) Resolving human object recognition in space and time. Nat Neurosci 17(3):455

    Article  Google Scholar 

  16. Lee TS, Mumford D (2003) Hierarchical Bayesian inference in the visual cortex. JOSA A 20(7):1434–1448

    Article  Google Scholar 

  17. Carreira J et al (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4733–4742

  18. Li K, Hariharan B, Malik J (2016) Iterative instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3659–3667

  19. Zamir AR et al (2017) Feedback networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1808–1817

  20. Cao C et al (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956–2964

  21. Stollenga MF, Masci J, Gomez F, Schmidhuber J (2014) Deep networks with internal selective attention through feedback connections. In: Advances in neural information processing systems, pp 3545–3553

  22. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) European conference on computer vision. Springer, Berlin, pp 818–833

    Google Scholar 

  23. Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2018–2025

  24. Wang Q, Zhang J, Song S, Zhang Z (2014) Attentional neural network: feature selection using cognitive feedback. In: Advances in neural information processing systems, pp 2033–2041

  25. Salakhutdinov R, Larochelle H (2010) Efficient learning of deep Boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 693–700

  26. Sohn K et al (2013) Learning and selecting features jointly with point-wise gated Boltzmann machines. In: International conference on machine learning, pp 217–225

  27. Li Y et al (2017) Fully convolutional instance-aware semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2359–2367

  28. Dai J et al (2016) Instance-sensitive fully convolutional networks. In: Leibe B, Matas J, Sebe N, Welling M (eds) European conference on computer vision. Springer, Berlin, pp 534–549

    Google Scholar 

  29. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NIPS

  30. Detry R, Papon J, Matthies L (2017) Taskoriented grasping with semantic and geometric scene understanding. In: IEEE/RSJ international conference on intelligent robots and systems

  31. Treml M et al (2016) Speeding up semantic segmentation for autonomous driving. In: MLITS, NIPS workshop

  32. Siam M et al (2017) Deep semantic segmentation for automated driving: taxonomy, roadmap and challenges. In: arXiv preprint arXiv:1707.02432

  33. Semwal VB, Raj M, Nandi GC (2015) Biometric gait identification based on a multilayer perceptron. Robot Auton Syst 65:65–75

    Article  Google Scholar 

  34. Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput Appl 28(3):565–574

    Article  Google Scholar 

  35. Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16(2):219–237

    Article  Google Scholar 

  36. Deco G, Zihl J (2001) A neurodynamical model of visual attention: feedback enhancement of spatial resolution in a hierarchical system. J Comput Neurosci 10(3):231–253

    Article  Google Scholar 

  37. Xingjian SHI, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810

  38. Deng J et al (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 248–255

  39. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  40. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  41. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533

    Article  Google Scholar 

  42. Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3367–3375

  43. Wyatte D, Jilk DJ, O’Reilly RC (2014) Early recurrent feedback facilitates visual object recognition under challenging conditions. Front Psychol 5:674

    Article  Google Scholar 

  44. Gregor K et al (2015) DRAW: a recurrent neural network for image generation. In: arXiv preprint arXiv:1502.04623

  45. Brostow GJ, Fauqueur J, Cipolla R (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recogn Lett 30(2):88–97

    Article  Google Scholar 

  46. Song S, Lichtenberg SP, Xiao J (2015) SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, vol 5, p 6

  47. Silberman N et al (2012) Indoor segmentation and support inference from RGBD images. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) European conference on computer vision. Springer, Berlin, pp 746–760

    Google Scholar 

  48. Everingham M et al (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136

    Article  Google Scholar 

  49. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. In: arXiv preprint arXiv:1412.6980

  50. Csurka G et al (2013) What is a good evaluation measure for semantic segmentation? In: BMVC, vol 27. Citeseer, p 2013

  51. Qi CR et al (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the computer vision and pattern recognition (CVPR), vol 1, no 2. IEEE, p 4

  52. Richtsfeld M, Vincze M (2011) Robotic grasping of unknown objects. In: Goto S (ed) Robot arms. InTech, London

    Google Scholar 

  53. Pas AT, Platt R (2015) Using geometry to detect grasps in 3d point clouds. In: arXiv preprint arXiv:1501.03100

Download references

Acknowledgements

This work was supported by the Technology Innovation Industrial Program funded by the Ministry of Trade (MI, South Korea) [10073161, Technology Innovation Program], as well as by Institute for Information and Communications Technology Promotion (IITP) Grant funded by MSIT (No. 2018-0-00622).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Il Hong Suh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, B.W., Park, Y. & Suh, I.H. Integration of top-down and bottom-up visual processing using a recurrent convolutional–deconvolutional neural network for semantic segmentation. Intel Serv Robotics 13, 87–97 (2020). https://doi.org/10.1007/s11370-019-00296-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11370-019-00296-5

Keywords

Navigation