Skip to main content
Log in

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Most existing RGB-D salient object detection (SOD) methods tend to achieve higher performance by integrating additional modules, such as feature enhancement and edge generation. There is no doubt that these modules will inevitably produce feature redundancy and performance degradation. To this end, we exquisitely design a cross-modal fusion and progressive decoding network (termed CPNet) to achieve RGB-D SOD tasks. The designed network structure only includes three indispensable parts: feature encoding, feature fusion and feature decoding. Specifically, in the feature encoding part, we adopt a two-stream Swin Transformer encoder to extract multi-level and multi-scale features from RGB images and depth images respectively to model global information. In the feature fusion part, we design a cross-modal attention fusion module, which can leverage the attention mechanism to fuse multi-modality and multi-level features. In the feature decoding part, we design a progressive decoder to gradually fuse low-level features and filter noise information to accurately predict salient objects. Extensive experimental results on 6 benchmarks demonstrated that our network surpasses 12 state-of-the-art methods in terms of four metrics. In addition, it is also verified that for the RGB-D SOD task, the addition of the feature enhancement module and the edge generation module is not conducive to improving the detection performance under this framework, which provides new insights into the salient object detection task. Our codes are available at https://github.com/hu-xh/CPNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In 2009 IEEE conference on computer vision and pattern recognition (pp. 1597–1604). IEEE.

  • Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2019). Salient object detection: A survey. Computational Visual Media, 5(2), 117–150.

    Article  Google Scholar 

  • Borji, A., Cheng, M.-M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  • Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3051–3060).

  • Chen, H., & Li, Y. (2019). Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing, 28(6), 2825–2835.

    Article  ADS  MathSciNet  Google Scholar 

  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  PubMed  Google Scholar 

  • Chen, Q., Liu, Z., Zhang, Y., Fu, K., Zhao, Q., & Du, H. (2021). Rgb-d salient object detection via 3d convolutional neural networks. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, pp. 1063–1071).

  • Chen, Z., Cong, R., Xu, Q., & Huang, Q. (2020). Dpanet: Depth potentiality-aware gated attention network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 7012–7024.

    Article  ADS  Google Scholar 

  • Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014). Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service (pp. 23–27).

  • Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S.-M. (2014). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.

    Article  Google Scholar 

  • Cong, R., Lei, J., Fu, H., Huang, Q., Cao, X., & Ling, N. (2018). Hscs: Hierarchical sparsity based co-saliency detection for rgbd images. IEEE Transactions on Multimedia, 21(7), 1660–1671.

    Article  Google Scholar 

  • Cong, R., Lin, Q., Zhang, C., Li, C., Cao, X., Huang, Q., & Zhao, Y. (2022). Cir-net: Cross-modality interaction and refinement for rgb-d salient object detection. IEEE Transactions on Image Processing, 31, 6800–6815.

    Article  ADS  PubMed  Google Scholar 

  • De Boer, P.-T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.

    Article  MathSciNet  Google Scholar 

  • Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., & Heng, P.-A. (2018). R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 684–690). AAAI Press Menlo Park, CA, USA.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.

  • Fan, D., Gong, C., Cao, Y., Ren, B., Cheng, M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI 2018, July 13–19, 2018, Stockholm, Sweden (pp. 698–704).

  • Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision (pp. 4548–4557).

  • Fan, D.-P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M.-M. (2020). Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2075–2089.

    Article  Google Scholar 

  • Fan, D.-P., Zhai, Y., Borji, A., Yang, J., & Shao, L. (2020). Bbs-net: Rgb-d salient object detection with a bifurcated backbone strategy network. In European conference on computer vision (pp. 275–292). Springer.

  • Feng, G., Meng, J., Zhang, L., & Lu, H. (2022). Encoder deep interleaved network with multi-scale aggregation for RGB-D salient object detection. Pattern Recognition, 128, 108666.

    Article  Google Scholar 

  • Fu, K., Fan, D.-P., Ji, G.-P., Zhao, Q., Shen, J., & Zhu, C. (2021). Siamese network for rgb-d salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5541–5559.

    Google Scholar 

  • Gao, S.-H., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., & Torr, P. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.

    Article  Google Scholar 

  • Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., & Harada, T. (2017). Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 5108–5115). IEEE.

  • Han, J., Zhang, D., Hu, X., Guo, L., Ren, J., & Wu, F. (2014). Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology, 25(8), 1309–1321.

    Google Scholar 

  • Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13713–13722).

  • Huang, N., Yang, Y., Zhang, D., Zhang, Q., & Han, J. (2021). Employing bilinear fusion and saliency prior information for rgb-d salient object detection. IEEE Transactions on Multimedia, 24, 1651–1664.

    Article  Google Scholar 

  • Ji, W., Li, J., Yu, S., Zhang, M., Piao, Y., Yao, S., Bi, Q., Ma, K., Zheng, Y., & Lu, H. (2021). Calibrated rgb-d salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9471–9481).

  • Ji, W., Li, J., Zhang, M., Piao, Y., & Lu, H. (2020). Accurate rgb-d salient object detection via collaborative learning. In European conference on computer vision (pp. 52–69). Springer.

  • Ji, W., Yan, G., Li, J., Piao, Y., Yao, S., Zhang, M., Cheng, L., & Lu, H. (2022). Dmra: Depth-induced multi-scale recurrent attention network for rgb-d saliency detection. IEEE Transactions on Image Processing, 31, 2321–2336.

    Article  ADS  PubMed  Google Scholar 

  • Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: A discriminative regional feature integration approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2083–2090).

  • Jiang, Z., & Davis, L. S. (2013). Submodular salient region detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2043–2050).

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings.

  • Klein, D. A., & Frintrop, S. (2011). Center-surround divergence of feature statistics for salient object detection. In 2011 International conference on computer vision (pp. 2214–2219). IEEE.

  • Lan, X., Gu, X., & Gu, X. (2022). Mmnet: Multi-modal multi-stage network for rgb-t image semantic segmentation. Applied Intelligence, 52(5), 5817–5829.

    Article  Google Scholar 

  • Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570). PMLR.

  • Lee, M., Park, C., Cho, S., & Lee, S. (2022). Spsn: Superpixel prototype sampling network for rgb-d salient object detection. In European conference on computer vision (pp. 630–647). Springer.

  • Li, G., Liu, Z., Chen, M., Bai, Z., Lin, W., & Ling, H. (2021). Hierarchical alternate interaction network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 3528–3542.

    Article  ADS  PubMed  Google Scholar 

  • Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2806–2813).

  • Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., & Zhang, Z. (2022). Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection. Neurocomputing, 490, 132–145.

    Article  Google Scholar 

  • Liu, J. J., Hou, Q., Cheng, M. M., Feng, J., & Jiang, J. (2019). A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3917–3926).

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).

  • Liu, Z., Shi, S., Duan, Q., Zhang, W., & Zhao, P. (2019). Salient object detection for rgb-d image by single stream recurrent convolution neural network. Neurocomputing, 363, 46–57.

    Article  Google Scholar 

  • Liu, Z., Tan, Y., He, Q., & Xiao, Y. (2021). Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4486–4497.

    Article  Google Scholar 

  • Liu, Z., Wang, Y., Tu, Z., Xiao, Y., & Tang, B. (2021). Tritransnet: Rgb-d salient object detection with a triplet transformer embedding network. In Proceedings of the 29th ACM international conference on multimedia (pp. 4481–4490).

  • Luo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S., & Jodoin, P.-M. (2017). Non-local deep features for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6609–6617).

  • Ma, J., Tang, L., Xu, M., Zhang, H., & Xiao, G. (2021). Stdfusionnet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement, 70, 1–13.

    Google Scholar 

  • Máttyus, G., Luo, W., & Urtasun, R. (2017). Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the IEEE international conference on computer vision (pp. 3438–3446).

  • Pang, Y., Zhang, L., Zhao, X., & Lu, H. (2020). Hierarchical dynamic filtering network for rgb-d salient object detection. In European conference on computer vision (pp. 235–252). Springer.

  • Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9413–9422).

  • Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2023). CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection. IEEE Transactions on Image Processing, 32, 892–904.

    Article  ADS  Google Scholar 

  • Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). Rgbd salient object detection: A benchmark and algorithms. In European conference on computer vision (pp. 92–109). Springer.

  • Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp. 733–740). IEEE.

  • Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7254–7263).

  • Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., & Jagersand, M. (2019). Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7479–7489).

  • Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404.

    Article  Google Scholar 

  • Song, M., Song, W., Yang, G., & Chen, C. (2022). Improving RGB-D salient object detection via modality-aware decoder. IEEE Transactions on Image Processing, 31, 6124–6138.

    Article  ADS  PubMed  Google Scholar 

  • Sun, Y., Zuo, W., Yun, P., Wang, H., & Liu, M. (2021). Fuseseg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Transactions on Automation Science and Engineering, 18(3), 1000–1011.

    Article  Google Scholar 

  • Wang, D., Liu, J., Liu, R., & Fan, X. (2023). An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Information Fusion, 98, 101828.

    Article  Google Scholar 

  • Wang, F., Pan, J., Xu, S., & Tang, J. (2022). Learning discriminative cross-modality features for rgb-d saliency detection. IEEE Transactions on Image Processing, 31, 1285–1297.

    Article  ADS  PubMed  Google Scholar 

  • Wang, F., Wang, R., & Sun, F. (2023). Dcmnet: Discriminant and cross-modality network for RGB-D salient object detection. Expert Systems with Applications, 214, 119047.

    Article  Google Scholar 

  • Wang, L., Lu, H., Ruan, X., & Yang, M.-H. (2015). Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3183–3192).

  • Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568–578).

  • Wei, J., Wang, S., & Huang, Q. (2020). F\(^3\)net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 12321–12328).

  • Wei, J., Wang, S., Wu, Z., Su, C., Huang, Q., & Tian, Q. (2020). Label decoupling framework for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13025–13034).

  • Woo, S., Park, J., Lee, J.-Y., & Kweon, I.S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

  • Wu, J., Sun, F., Xu, R., Meng, J., & Wang, F. (2022). Aggregate interactive learning for rgb-d salient object detection. Expert Systems with Applications, 195, 116614.

    Article  Google Scholar 

  • Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., & Ding, E. (2019). A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8150–8159).

  • Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1155–1162).

  • Yang, Y., Qin, Q., Luo, Y., Liu, Y., Zhang, Q., & Han, J. (2022). Bi-directional progressive guidance network for RGB-D salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(8), 5346–5360.

    Article  Google Scholar 

  • Zhang, D., Meng, D., & Han, J. (2016). Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5), 865–878.

    Article  CAS  PubMed  Google Scholar 

  • Zhang, L., Dai, J., Lu, H., He, Y., & Wang, G. (2018). A bi-directional message passing model for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1741–1750).

  • Zhang, M., Ren, W., Piao, Y., Rong, Z., & Lu, H. (2020). Select, supplement and focus for rgb-d saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3472–3481).

  • Zhang, M., Yao, S., Hu, B., Piao, Y., & Ji, W. (2023). C\(^2\)dfnet: Criss-cross dynamic filter network for rgb-d salient object detection. IEEE Transactions on Multimedia, 25, 5142–5154.

    Article  Google Scholar 

  • Zhang, P., Wang, D., Lu, H., Wang, H., & Ruan, X. (2017) Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE international conference on computer vision (pp. 202–211).

  • Zhang, Q., Zhao, S., Luo, Y., Zhang, D., Huang, N., & Han, J. (2021). Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2633–2642).

  • Zhao, J.-X., Cao, Y., Fan, D.-P., Cheng, M.-M., Li, X.-Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for rgbd salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3927–3936).

  • Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J., & Cheng, M.-M. (2019). Egnet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8779–8788).

  • Zhao, X., Pang, Y., Zhang, L., Lu, H., & Ruan, X. (2022). Self-supervised pretraining for rgb-d salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 3463–3471).

  • Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2020) Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision (pp. 35–51). Springer.

  • Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020). A single stream network for robust and real-time rgb-d salient object detection. In European conference on computer vision (pp. 646–662). Springer.

  • Zhou, L., Gong, C., Liu, Z., & Fu, K. (2020). Sal: Selection and attention losses for weakly supervised semantic segmentation. IEEE Transactions on Multimedia, 23, 1035–1048.

    Article  Google Scholar 

  • Zhou, W., Dong, S., Xu, C., & Qian, Y. (2022). Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 3571–3579).

  • Zhou, W., Guo, Q., Lei, J., Yu, L., & Hwang, J.-N. (2021). Ecffnet: Effective and consistent feature fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1224–1235.

    Article  Google Scholar 

  • Zhu, C., Cai, X., Huang, K., Li, T. H., & Li, G. (2019). Pdnet: Prior-model guided depth-enhanced network for salient object detection. In 2019 IEEE international conference on multimedia and expo (ICME) (pp. 199–204). IEEE.

  • Zhuge, M., Fan, D.-P., Liu, N., Zhang, D., Xu, D., & Shao, L. (2022). Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3738–3752.

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61976042 and 61972068 and by the Liaoning Revitalization Talents Program under Grant XLYC2007023.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fuming Sun.

Additional information

Communicated by Massimiliano Mancini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, X., Sun, F., Sun, J. et al. Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02020-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02020-y

Keywords

Navigation