Abstract
Semantic segmentation has a wide array of applications such as scene understanding, autonomous driving, and robot manipulation tasks. While existing segmentation models have achieved good performance using bottom-up deep neural processing, this paper describes a novel deep learning architecture that integrates top-down and bottom-up processing. The resulting model achieves higher accuracy at a relatively low computational cost. In the proposed model, higher-level top-down information is transmitted to the lower layers through recurrent connections in an encoder and a decoder, and the recurrent connection weights are trained using backpropagation. Experiments on several benchmark datasets demonstrate that this use of top-down information improves the mean intersection over union by more than 3% compared with a state-of-the-art bottom-up only network using the CamVid, SUN-RGBD and PASCAL VOC 2012 benchmark datasets. Additionally, the proposed model is successfully applied to a dataset designed for robotic grasping tasks.
Similar content being viewed by others
References
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1520–1528
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
He K et al (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988
Hariharan B et al (2015) Hypercolumns for object segmentation and finegrained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 447–456
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv preprint arXiv:1409.1556
Chen L-C et al (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: arXiv preprint arXiv:1606.00915
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Advances in neural information processing systems, pp 109–117
Russell C, Kohli P, Torr PHS et al (2009) Associative hierarchical CRFs for object class image segmentation. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 739–746
Uijlings JRR et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Zitnick CL, Doll’ar P (2014) Edge boxes: locating object proposals from edges. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) European conference on computer vision. Springer, Berlin, pp 391–405
Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Girshick R (2015) Fast r-cnn. In: arXiv preprint arXiv:1504.08083
Goldstein EB (2014) Cognitive psychology: connecting mind, research and everyday experience. Nelson Education, Scarborough
Cichy RM, Pantazis D, Oliva A (2014) Resolving human object recognition in space and time. Nat Neurosci 17(3):455
Lee TS, Mumford D (2003) Hierarchical Bayesian inference in the visual cortex. JOSA A 20(7):1434–1448
Carreira J et al (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4733–4742
Li K, Hariharan B, Malik J (2016) Iterative instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3659–3667
Zamir AR et al (2017) Feedback networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1808–1817
Cao C et al (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956–2964
Stollenga MF, Masci J, Gomez F, Schmidhuber J (2014) Deep networks with internal selective attention through feedback connections. In: Advances in neural information processing systems, pp 3545–3553
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) European conference on computer vision. Springer, Berlin, pp 818–833
Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2018–2025
Wang Q, Zhang J, Song S, Zhang Z (2014) Attentional neural network: feature selection using cognitive feedback. In: Advances in neural information processing systems, pp 2033–2041
Salakhutdinov R, Larochelle H (2010) Efficient learning of deep Boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 693–700
Sohn K et al (2013) Learning and selecting features jointly with point-wise gated Boltzmann machines. In: International conference on machine learning, pp 217–225
Li Y et al (2017) Fully convolutional instance-aware semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2359–2367
Dai J et al (2016) Instance-sensitive fully convolutional networks. In: Leibe B, Matas J, Sebe N, Welling M (eds) European conference on computer vision. Springer, Berlin, pp 534–549
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NIPS
Detry R, Papon J, Matthies L (2017) Taskoriented grasping with semantic and geometric scene understanding. In: IEEE/RSJ international conference on intelligent robots and systems
Treml M et al (2016) Speeding up semantic segmentation for autonomous driving. In: MLITS, NIPS workshop
Siam M et al (2017) Deep semantic segmentation for automated driving: taxonomy, roadmap and challenges. In: arXiv preprint arXiv:1707.02432
Semwal VB, Raj M, Nandi GC (2015) Biometric gait identification based on a multilayer perceptron. Robot Auton Syst 65:65–75
Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput Appl 28(3):565–574
Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16(2):219–237
Deco G, Zihl J (2001) A neurodynamical model of visual attention: feedback enhancement of spatial resolution in a hierarchical system. J Comput Neurosci 10(3):231–253
Xingjian SHI, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810
Deng J et al (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 248–255
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533
Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3367–3375
Wyatte D, Jilk DJ, O’Reilly RC (2014) Early recurrent feedback facilitates visual object recognition under challenging conditions. Front Psychol 5:674
Gregor K et al (2015) DRAW: a recurrent neural network for image generation. In: arXiv preprint arXiv:1502.04623
Brostow GJ, Fauqueur J, Cipolla R (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recogn Lett 30(2):88–97
Song S, Lichtenberg SP, Xiao J (2015) SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, vol 5, p 6
Silberman N et al (2012) Indoor segmentation and support inference from RGBD images. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) European conference on computer vision. Springer, Berlin, pp 746–760
Everingham M et al (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. In: arXiv preprint arXiv:1412.6980
Csurka G et al (2013) What is a good evaluation measure for semantic segmentation? In: BMVC, vol 27. Citeseer, p 2013
Qi CR et al (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the computer vision and pattern recognition (CVPR), vol 1, no 2. IEEE, p 4
Richtsfeld M, Vincze M (2011) Robotic grasping of unknown objects. In: Goto S (ed) Robot arms. InTech, London
Pas AT, Platt R (2015) Using geometry to detect grasps in 3d point clouds. In: arXiv preprint arXiv:1501.03100
Acknowledgements
This work was supported by the Technology Innovation Industrial Program funded by the Ministry of Trade (MI, South Korea) [10073161, Technology Innovation Program], as well as by Institute for Information and Communications Technology Promotion (IITP) Grant funded by MSIT (No. 2018-0-00622).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, B.W., Park, Y. & Suh, I.H. Integration of top-down and bottom-up visual processing using a recurrent convolutional–deconvolutional neural network for semantic segmentation. Intel Serv Robotics 13, 87–97 (2020). https://doi.org/10.1007/s11370-019-00296-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11370-019-00296-5