Skip to main content
Log in

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection, and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which we use the progressive predictions from the well-learned source modality to supervise learning feature hierarchies and inference in the new modality. To better select complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal cross-level interactions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in learning from a new modality, the advantages of the proposed multi-modal fusion pattern in selecting and fusing cross-modal complements, and the generalization of the proposed designs in different tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Alpert, S., Galun, M., Basri, R., & Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1–8).

  • Borji, A., Cheng, M. M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.

    Article  MathSciNet  Google Scholar 

  • Camplani, M., Hannuna, S.L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L., & Burghardt, T. (2015). Real-time rgb-d tracking with depth scaling kernelised correlation filters and occlusion handling. In Proceedings of the British machine vision conference (pp. 145–1).

  • Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3051–3060).

  • Chen, H., Li, Y., & Su, D. (2018). Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition.

  • Chen, H., & Li, Y. (2019). Three-stream attention-aware network for RGB-D salient object detection. IEEE Transactions on Image Processing, 28(6), 2825–2835.

    Article  MathSciNet  Google Scholar 

  • Cheng, Y., Cai, R., Li, Z., Zhao, X., & Huang, K. (2017). Locality sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 3).

  • Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014) Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service (ICIMCS) (pp. 23–27).

  • Cheng, M. M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S. M. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.

    Article  Google Scholar 

  • Christoudias, C.M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In Proceedings of European conference on computer vision (pp. 677–691).

  • Ciptadi, A., Hermans, T., & Rehg, J. M. (2013). An in depth view of saliency. In Proceedings of the British machine vision conference.

  • Cong, R., Lei, J., Fu, H., Lin, W., Huang, Q., Cao, X., & Hou, C. (2017). An iterative co-saliency framework for RGBD images. IEEE Transactions on Cybernetics

  • Cong, R., Lei, J., Fu, H., Huang, Q., Cao, X., & Hou, C. (2018). Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation. IEEE Transactions on Image Processing, 27(2), 568–579.

    Article  MathSciNet  Google Scholar 

  • Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., & Hou, C. (2016). Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. Signal Processing Letters, 23(6), 819–823.

    Article  Google Scholar 

  • Desingh, K., Krishna, K. M., Rajan, D., & Jawahar, C. (2013). Depth really matters: Improving visual salient region detection with depth. In Proceedings of the British machine vision conference.

  • Du, D., Wang, L., Wang, H., Zhao, K, & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 11836–11845.

  • Fan, D. P., Cheng, M. M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE computer society conference on computer vision (pp. 4548–4557).

  • Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of IJCAI.

  • Fan, D. P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M. M. (2020). Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networking Learning Systems.

  • Fan, X., Liu, Z., & Sun, G. (2014). Salient region detection for stereoscopic images. In Proceedings of international conference on digital signal process (pp. 454–458).

  • Feng, D., Barnes, N., You, S., & McCarthy, C. (2016). Local background enclosure for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2343–2350).

  • Fu, H., Xu, D., Lin, S., & Liu, J. (2015). Object-based rgbd image co-segmentation with mutex constraint. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4428–4436).

  • Fu, H., Xu, D., & Lin, S. (2017). Object-based multiple foreground segmentation in RGBD video. IEEE Transactions on Image Processing, 26(3), 1418–1427.

    Article  MathSciNet  Google Scholar 

  • Garcia, N.C., Morerio, P., & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In Proceedings of European conference on computer vision (pp. 103–118).

  • Guo, J., Ren, T., & Bei, J. (2016). Salient object detection for RGB-D image via saliency evolution. In Proceedings of IEEE international conference on multimedia and expo (pp. 1–6).

  • Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of European conference on computer vision (pp. 345–360).

  • Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2827–2836).

  • Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 112(2), 133–149.

    Article  MathSciNet  Google Scholar 

  • Han, J., Chen, H., Liu, N., Yan, C., & Li, X. (2017). Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics.

  • Han, J., Shao, L., Xu, D., & Shotton, J. (2013). Enhanced computer vision with microsoft kinect sensor: A review. IEEE Transactions on Cybernetics, 43(5), 1318–1334.

    Article  Google Scholar 

  • Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In Proceedings of advances in neural information processing systems (pp. 545–552).

  • Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of Asian conference computer vision (pp. 213–228). Springer.

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Hoffman, J., Gupta, S., & Darrell, T. (2016). Learning with side information through modality hallucination. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 826–834).

  • Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., & Torr, P. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 5300–5309).

  • Hou, J., Dai, A., & Nießner, M. (2019). 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4421–4430).

  • Huang, Z., & Wang, N. (2017). Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on multimedia (pp. 675–678).

  • Ju, R., Ge, L., Geng, W., Ren, T., & Wu, G. (2014). Depth saliency based on anisotropic center-surround difference. In Proceedings of European conference on image process (pp. 1115–1119).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of advances in neural information processing systems (pp. 1097–1105).

  • Lang, C., Nguyen, T. V., Katti, H., Yadati, K., Kankanhalli, M., & Yan, S. (2012). Depth matters: Influence of depth cues on visual saliency. In Proceedings of European conference on computer vision (pp. 101–115).

  • Lenc, K., & Vedaldi, A. (2015). Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 991–999).

  • Li, Q., Jin, S., & Yan, J. (2017). Mimicking very efficient network for object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7341–7349).

  • Li, J., Liu, Y., Gong, D., Shi, Q., Yuan, X., Zhao, C., & Reid, I. (2019). RGBD based dimensional decomposition residual network for 3D semantic scene completion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7693–7702).

  • Li, G., Liu, Z., Ye, L., Wang, Y., & Ling, H. (2020b). Cross-modal weighting network for RGB-D salient object detection. In Proceedings of European conference on computer vision (pp. 665–681).

  • Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2806–2813).

  • Li, G., Gan, Y., Wu, H., Xiao, N., & Lin, L. (2018). Cross-modal attentional context learning for RGB-D object detection. IEEE Transactions on Image Processing, 28(4), 1591–1601.

    Article  MathSciNet  Google Scholar 

  • Li, G., Liu, Z., & Ling, H. (2020a). ICNet: Information conversion network for RGB-D based salient object detection. IEEE Transactions on Image Processing, 29, 4873–4884.

    Article  Google Scholar 

  • Lin, D., Chen, G., Cohen-Or. D., Heng. P. A., & Huang, H. (2017a). Cascaded feature network for semantic segmentation of RGB-D images. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1320–1328).

  • Lin, G., Milan, A., Shen, C., & Reid, I. (2017b). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 1, p. 3).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3431–3440).

  • Mahadevan, V., Vasconcelos, N., et al. (2013). Biologically inspired object tracking using center-surround saliency mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3), 541–554.

    Article  Google Scholar 

  • Margolin, R., Zelnik-Manor, L., & Tal, A. (2014). How to evaluate foreground maps? In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 248–255).

  • Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3994–4003).

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of international conference on machine learning (pp. 689–696).

  • Niu, Y., Geng, Y., Li, X., & Liu, F. (2012). Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 454–461).

  • Park, S.J., Hong, K.S., & Lee, S. (2017). RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition.

  • Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). RGBD salient object detection: a benchmark and algorithms. In Proceedings of European conference on computer vision (pp. 92–109).

  • Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE computer society conference on computer vision (pp. 7254–7263).

  • Piao, Y., Rong, Z., Zhang, M., & Lu, H. (2020). Exploit and replace: An asymmetrical two-stream architecture for versatile light field saliency detection. In AAAI (pp. 11865–11873).

  • Qi, X., Liao, R., Jia, J., Fidler, S., & Urtasun, R. (2017). 3D graph neural networks for RGBD semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 5199–5208).

  • Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., & Yang, Q. (2017). Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing, 26(5), 2274–2285.

    Article  MathSciNet  Google Scholar 

  • Ren, J., Gong, X., Yu, L., Zhou, W., & Ying Yang, M. (2015). Exploiting global priors for RGB-D saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 25–32).

  • Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014). Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.

  • Shao, L., & Brady, M. (2006). Specific object retrieval based on salient regions. Pattern Recognition, 39(10), 1932–1948.

    Article  Google Scholar 

  • Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Proceedings of advances in neural information processing systems (pp. 935–943).

  • Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 567–576).

  • Song, H., Liu, Z., Du, H., Sun, G., Le Meur, O., & Ren, T. (2017). Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE Transactions on Image Processing, 26(9), 4204–4216.

    Article  MathSciNet  Google Scholar 

  • Wang, W., & Neumann, U. (2018). Depth-aware CNN for RGB-D segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 135–150).

  • Wang, A., Cai, J., Lu, J., & Cham, T. J. (2015). Mmss: Multi-modal sharable and specific feature learning for RGB-D object recognition. In Proceedings of the IEEE computer society conference on computer vision (pp. 1125–1133).

  • Xu, X., Li, Y., Wu, G., & Luo, J. (2017). Multi-modal deep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.

    Article  Google Scholar 

  • Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1155–1162).

  • Yang, J., & Yang, M. H. (2017). Top-down visual saliency via joint crf and dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(3), 576–588.

    Article  Google Scholar 

  • Zeng, J., Tong, Y., Huang, Y., Yan, Q., Sun, W., Chen, J., & Wang, Y. (2019). Deep surface normal estimation with hierarchical RGB-D fusion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 6153–6162).

  • Zhang, M., Li, J., Wei, J., Piao, Y., & Lu, H. (2019). Memory-oriented decoder for light field salient object detection. In Proceedings of advances in neural information processing systems (pp. 898–908).

  • Zhang, M., Ji, W., Piao, Y., Li, J., Zhang, Y., Xu, S., et al. (2020). LFNet: Light field fusion network for salient object detection. IEEE Transactions on Image Processing, 29, 6276–6287.

    Article  Google Scholar 

  • Zhao, J. X., Cao, Y., Fan, D. P., Cheng, M. M., Li, X. Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3927–3936).

  • Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2020a). Suppress and balance: A simple gated network for salient object detection. In Proceedings of European conference on computer vision.

  • Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020b). A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of European conference on computer vision.

  • Zhou, T., Fan, D. P., Cheng, M. M., Shen, J., & Shao, L. (2020). RGB-D salient object detection: A survey. Computational Visual Media, pp. 1–33

  • Zhu, C., & Li, G. (2017). A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In Proceedings of the IEEE computer society conference on computer vision (pp. 3008–3014).

  • Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2969–2976).

Download references

Acknowledgements

This work was supported in part by the Research Grants Council of Hong Kong under Project CityU 11203619, 11213420; in part by the Delta-NTU Corporate Lab for Cyber-Physical Systems with Delta Electronics Inc.; in part by the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme; in part by the National Research Foundation Singapore under its AI Singapore Program under Award AISG-RP-2018-003; and in part by the MOE Tier-1 Research under Grant RG22/19 (S).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youfu Li.

Additional information

Communicated by Jian Sun.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Li, Y., Deng, Y. et al. CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse. Int J Comput Vis 129, 2076–2096 (2021). https://doi.org/10.1007/s11263-021-01452-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01452-0

Keywords

Navigation