CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Chen, Hao; Li, Youfu; Deng, Yongjian; Lin, Guosheng

doi:10.1007/s11263-021-01452-0

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Published: 05 May 2021

Volume 129, pages 2076–2096, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Hao Chen ORCID: orcid.org/0000-0002-3138-505X^1,2,3,4,
Youfu Li³,
Yongjian Deng³ &
…
Guosheng Lin⁴

1522 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection, and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which we use the progressive predictions from the well-learned source modality to supervise learning feature hierarchies and inference in the new modality. To better select complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal cross-level interactions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in learning from a new modality, the advantages of the proposed multi-modal fusion pattern in selecting and fusing cross-modal complements, and the generalization of the proposed designs in different tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Tausif Diwan, G. Anirudh & Jitendra V. Tembhurne

End-to-End Object Detection with Transformers

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Meng-Hao Guo, Tian-Xing Xu, … Shi-Min Hu

References

Alpert, S., Galun, M., Basri, R., & Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1–8).
Borji, A., Cheng, M. M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.
Article MathSciNet Google Scholar
Camplani, M., Hannuna, S.L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L., & Burghardt, T. (2015). Real-time rgb-d tracking with depth scaling kernelised correlation filters and occlusion handling. In Proceedings of the British machine vision conference (pp. 145–1).
Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3051–3060).
Chen, H., Li, Y., & Su, D. (2018). Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition.
Chen, H., & Li, Y. (2019). Three-stream attention-aware network for RGB-D salient object detection. IEEE Transactions on Image Processing, 28(6), 2825–2835.
Article MathSciNet Google Scholar
Cheng, Y., Cai, R., Li, Z., Zhao, X., & Huang, K. (2017). Locality sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 3).
Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014) Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service (ICIMCS) (pp. 23–27).
Cheng, M. M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S. M. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.
Article Google Scholar
Christoudias, C.M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In Proceedings of European conference on computer vision (pp. 677–691).
Ciptadi, A., Hermans, T., & Rehg, J. M. (2013). An in depth view of saliency. In Proceedings of the British machine vision conference.
Cong, R., Lei, J., Fu, H., Lin, W., Huang, Q., Cao, X., & Hou, C. (2017). An iterative co-saliency framework for RGBD images. IEEE Transactions on Cybernetics
Cong, R., Lei, J., Fu, H., Huang, Q., Cao, X., & Hou, C. (2018). Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation. IEEE Transactions on Image Processing, 27(2), 568–579.
Article MathSciNet Google Scholar
Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., & Hou, C. (2016). Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. Signal Processing Letters, 23(6), 819–823.
Article Google Scholar
Desingh, K., Krishna, K. M., Rajan, D., & Jawahar, C. (2013). Depth really matters: Improving visual salient region detection with depth. In Proceedings of the British machine vision conference.
Du, D., Wang, L., Wang, H., Zhao, K, & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 11836–11845.
Fan, D. P., Cheng, M. M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE computer society conference on computer vision (pp. 4548–4557).
Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of IJCAI.
Fan, D. P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M. M. (2020). Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networking Learning Systems.
Fan, X., Liu, Z., & Sun, G. (2014). Salient region detection for stereoscopic images. In Proceedings of international conference on digital signal process (pp. 454–458).
Feng, D., Barnes, N., You, S., & McCarthy, C. (2016). Local background enclosure for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2343–2350).
Fu, H., Xu, D., Lin, S., & Liu, J. (2015). Object-based rgbd image co-segmentation with mutex constraint. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4428–4436).
Fu, H., Xu, D., & Lin, S. (2017). Object-based multiple foreground segmentation in RGBD video. IEEE Transactions on Image Processing, 26(3), 1418–1427.
Article MathSciNet Google Scholar
Garcia, N.C., Morerio, P., & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In Proceedings of European conference on computer vision (pp. 103–118).
Guo, J., Ren, T., & Bei, J. (2016). Salient object detection for RGB-D image via saliency evolution. In Proceedings of IEEE international conference on multimedia and expo (pp. 1–6).
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of European conference on computer vision (pp. 345–360).
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2827–2836).
Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 112(2), 133–149.
Article MathSciNet Google Scholar
Han, J., Chen, H., Liu, N., Yan, C., & Li, X. (2017). Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics.
Han, J., Shao, L., Xu, D., & Shotton, J. (2013). Enhanced computer vision with microsoft kinect sensor: A review. IEEE Transactions on Cybernetics, 43(5), 1318–1334.
Article Google Scholar
Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In Proceedings of advances in neural information processing systems (pp. 545–552).
Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of Asian conference computer vision (pp. 213–228). Springer.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Hoffman, J., Gupta, S., & Darrell, T. (2016). Learning with side information through modality hallucination. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 826–834).
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., & Torr, P. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 5300–5309).
Hou, J., Dai, A., & Nießner, M. (2019). 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4421–4430).
Huang, Z., & Wang, N. (2017). Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on multimedia (pp. 675–678).
Ju, R., Ge, L., Geng, W., Ren, T., & Wu, G. (2014). Depth saliency based on anisotropic center-surround difference. In Proceedings of European conference on image process (pp. 1115–1119).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of advances in neural information processing systems (pp. 1097–1105).
Lang, C., Nguyen, T. V., Katti, H., Yadati, K., Kankanhalli, M., & Yan, S. (2012). Depth matters: Influence of depth cues on visual saliency. In Proceedings of European conference on computer vision (pp. 101–115).
Lenc, K., & Vedaldi, A. (2015). Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 991–999).
Li, Q., Jin, S., & Yan, J. (2017). Mimicking very efficient network for object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7341–7349).
Li, J., Liu, Y., Gong, D., Shi, Q., Yuan, X., Zhao, C., & Reid, I. (2019). RGBD based dimensional decomposition residual network for 3D semantic scene completion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7693–7702).
Li, G., Liu, Z., Ye, L., Wang, Y., & Ling, H. (2020b). Cross-modal weighting network for RGB-D salient object detection. In Proceedings of European conference on computer vision (pp. 665–681).
Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2806–2813).
Li, G., Gan, Y., Wu, H., Xiao, N., & Lin, L. (2018). Cross-modal attentional context learning for RGB-D object detection. IEEE Transactions on Image Processing, 28(4), 1591–1601.
Article MathSciNet Google Scholar
Li, G., Liu, Z., & Ling, H. (2020a). ICNet: Information conversion network for RGB-D based salient object detection. IEEE Transactions on Image Processing, 29, 4873–4884.
Article Google Scholar
Lin, D., Chen, G., Cohen-Or. D., Heng. P. A., & Huang, H. (2017a). Cascaded feature network for semantic segmentation of RGB-D images. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1320–1328).
Lin, G., Milan, A., Shen, C., & Reid, I. (2017b). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 1, p. 3).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3431–3440).
Mahadevan, V., Vasconcelos, N., et al. (2013). Biologically inspired object tracking using center-surround saliency mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3), 541–554.
Article Google Scholar
Margolin, R., Zelnik-Manor, L., & Tal, A. (2014). How to evaluate foreground maps? In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 248–255).
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3994–4003).
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of international conference on machine learning (pp. 689–696).
Niu, Y., Geng, Y., Li, X., & Liu, F. (2012). Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 454–461).
Park, S.J., Hong, K.S., & Lee, S. (2017). RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition.
Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). RGBD salient object detection: a benchmark and algorithms. In Proceedings of European conference on computer vision (pp. 92–109).
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE computer society conference on computer vision (pp. 7254–7263).
Piao, Y., Rong, Z., Zhang, M., & Lu, H. (2020). Exploit and replace: An asymmetrical two-stream architecture for versatile light field saliency detection. In AAAI (pp. 11865–11873).
Qi, X., Liao, R., Jia, J., Fidler, S., & Urtasun, R. (2017). 3D graph neural networks for RGBD semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 5199–5208).
Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., & Yang, Q. (2017). Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing, 26(5), 2274–2285.
Article MathSciNet Google Scholar
Ren, J., Gong, X., Yu, L., Zhou, W., & Ying Yang, M. (2015). Exploiting global priors for RGB-D saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 25–32).
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014). Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
Shao, L., & Brady, M. (2006). Specific object retrieval based on salient regions. Pattern Recognition, 39(10), 1932–1948.
Article Google Scholar
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Proceedings of advances in neural information processing systems (pp. 935–943).
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 567–576).
Song, H., Liu, Z., Du, H., Sun, G., Le Meur, O., & Ren, T. (2017). Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE Transactions on Image Processing, 26(9), 4204–4216.
Article MathSciNet Google Scholar
Wang, W., & Neumann, U. (2018). Depth-aware CNN for RGB-D segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 135–150).
Wang, A., Cai, J., Lu, J., & Cham, T. J. (2015). Mmss: Multi-modal sharable and specific feature learning for RGB-D object recognition. In Proceedings of the IEEE computer society conference on computer vision (pp. 1125–1133).
Xu, X., Li, Y., Wu, G., & Luo, J. (2017). Multi-modal deep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.
Article Google Scholar
Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1155–1162).
Yang, J., & Yang, M. H. (2017). Top-down visual saliency via joint crf and dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(3), 576–588.
Article Google Scholar
Zeng, J., Tong, Y., Huang, Y., Yan, Q., Sun, W., Chen, J., & Wang, Y. (2019). Deep surface normal estimation with hierarchical RGB-D fusion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 6153–6162).
Zhang, M., Li, J., Wei, J., Piao, Y., & Lu, H. (2019). Memory-oriented decoder for light field salient object detection. In Proceedings of advances in neural information processing systems (pp. 898–908).
Zhang, M., Ji, W., Piao, Y., Li, J., Zhang, Y., Xu, S., et al. (2020). LFNet: Light field fusion network for salient object detection. IEEE Transactions on Image Processing, 29, 6276–6287.
Article Google Scholar
Zhao, J. X., Cao, Y., Fan, D. P., Cheng, M. M., Li, X. Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3927–3936).
Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2020a). Suppress and balance: A simple gated network for salient object detection. In Proceedings of European conference on computer vision.
Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020b). A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of European conference on computer vision.
Zhou, T., Fan, D. P., Cheng, M. M., Shen, J., & Shao, L. (2020). RGB-D salient object detection: A survey. Computational Visual Media, pp. 1–33
Zhu, C., & Li, G. (2017). A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In Proceedings of the IEEE computer society conference on computer vision (pp. 3008–3014).
Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2969–2976).

Download references

Acknowledgements

This work was supported in part by the Research Grants Council of Hong Kong under Project CityU 11203619, 11213420; in part by the Delta-NTU Corporate Lab for Cyber-Physical Systems with Delta Electronics Inc.; in part by the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme; in part by the National Research Foundation Singapore under its AI Singapore Program under Award AISG-RP-2018-003; and in part by the MOE Tier-1 Research under Grant RG22/19 (S).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Hao Chen
Key Lab of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing, 211189, China
Hao Chen
Department of Mechanical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR
Hao Chen, Youfu Li & Yongjian Deng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Hao Chen & Guosheng Lin

Authors

Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Youfu Li
View author publications
You can also search for this author in PubMed Google Scholar
Yongjian Deng
View author publications
You can also search for this author in PubMed Google Scholar
Guosheng Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youfu Li.

Additional information

Communicated by Jian Sun.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, H., Li, Y., Deng, Y. et al. CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse. Int J Comput Vis 129, 2076–2096 (2021). https://doi.org/10.1007/s11263-021-01452-0

Download citation

Received: 17 January 2020
Accepted: 28 February 2021
Published: 05 May 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11263-021-01452-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

End-to-End Object Detection with Transformers

Attention mechanisms in computer vision: A survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

End-to-End Object Detection with Transformers

Attention mechanisms in computer vision: A survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation