Abstract
Most existing RGB-D salient object detection (SOD) methods tend to achieve higher performance by integrating additional modules, such as feature enhancement and edge generation. There is no doubt that these modules will inevitably produce feature redundancy and performance degradation. To this end, we exquisitely design a cross-modal fusion and progressive decoding network (termed CPNet) to achieve RGB-D SOD tasks. The designed network structure only includes three indispensable parts: feature encoding, feature fusion and feature decoding. Specifically, in the feature encoding part, we adopt a two-stream Swin Transformer encoder to extract multi-level and multi-scale features from RGB images and depth images respectively to model global information. In the feature fusion part, we design a cross-modal attention fusion module, which can leverage the attention mechanism to fuse multi-modality and multi-level features. In the feature decoding part, we design a progressive decoder to gradually fuse low-level features and filter noise information to accurately predict salient objects. Extensive experimental results on 6 benchmarks demonstrated that our network surpasses 12 state-of-the-art methods in terms of four metrics. In addition, it is also verified that for the RGB-D SOD task, the addition of the feature enhancement module and the edge generation module is not conducive to improving the detection performance under this framework, which provides new insights into the salient object detection task. Our codes are available at https://github.com/hu-xh/CPNet.
Similar content being viewed by others
References
Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In 2009 IEEE conference on computer vision and pattern recognition (pp. 1597–1604). IEEE.
Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2019). Salient object detection: A survey. Computational Visual Media, 5(2), 117–150.
Borji, A., Cheng, M.-M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.
Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3051–3060).
Chen, H., & Li, Y. (2019). Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing, 28(6), 2825–2835.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Chen, Q., Liu, Z., Zhang, Y., Fu, K., Zhao, Q., & Du, H. (2021). Rgb-d salient object detection via 3d convolutional neural networks. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, pp. 1063–1071).
Chen, Z., Cong, R., Xu, Q., & Huang, Q. (2020). Dpanet: Depth potentiality-aware gated attention network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 7012–7024.
Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014). Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service (pp. 23–27).
Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S.-M. (2014). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.
Cong, R., Lei, J., Fu, H., Huang, Q., Cao, X., & Ling, N. (2018). Hscs: Hierarchical sparsity based co-saliency detection for rgbd images. IEEE Transactions on Multimedia, 21(7), 1660–1671.
Cong, R., Lin, Q., Zhang, C., Li, C., Cao, X., Huang, Q., & Zhao, Y. (2022). Cir-net: Cross-modality interaction and refinement for rgb-d salient object detection. IEEE Transactions on Image Processing, 31, 6800–6815.
De Boer, P.-T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.
Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., & Heng, P.-A. (2018). R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 684–690). AAAI Press Menlo Park, CA, USA.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.
Fan, D., Gong, C., Cao, Y., Ren, B., Cheng, M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI 2018, July 13–19, 2018, Stockholm, Sweden (pp. 698–704).
Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision (pp. 4548–4557).
Fan, D.-P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M.-M. (2020). Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2075–2089.
Fan, D.-P., Zhai, Y., Borji, A., Yang, J., & Shao, L. (2020). Bbs-net: Rgb-d salient object detection with a bifurcated backbone strategy network. In European conference on computer vision (pp. 275–292). Springer.
Feng, G., Meng, J., Zhang, L., & Lu, H. (2022). Encoder deep interleaved network with multi-scale aggregation for RGB-D salient object detection. Pattern Recognition, 128, 108666.
Fu, K., Fan, D.-P., Ji, G.-P., Zhao, Q., Shen, J., & Zhu, C. (2021). Siamese network for rgb-d salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5541–5559.
Gao, S.-H., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., & Torr, P. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., & Harada, T. (2017). Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 5108–5115). IEEE.
Han, J., Zhang, D., Hu, X., Guo, L., Ren, J., & Wu, F. (2014). Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology, 25(8), 1309–1321.
Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13713–13722).
Huang, N., Yang, Y., Zhang, D., Zhang, Q., & Han, J. (2021). Employing bilinear fusion and saliency prior information for rgb-d salient object detection. IEEE Transactions on Multimedia, 24, 1651–1664.
Ji, W., Li, J., Yu, S., Zhang, M., Piao, Y., Yao, S., Bi, Q., Ma, K., Zheng, Y., & Lu, H. (2021). Calibrated rgb-d salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9471–9481).
Ji, W., Li, J., Zhang, M., Piao, Y., & Lu, H. (2020). Accurate rgb-d salient object detection via collaborative learning. In European conference on computer vision (pp. 52–69). Springer.
Ji, W., Yan, G., Li, J., Piao, Y., Yao, S., Zhang, M., Cheng, L., & Lu, H. (2022). Dmra: Depth-induced multi-scale recurrent attention network for rgb-d saliency detection. IEEE Transactions on Image Processing, 31, 2321–2336.
Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: A discriminative regional feature integration approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2083–2090).
Jiang, Z., & Davis, L. S. (2013). Submodular salient region detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2043–2050).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings.
Klein, D. A., & Frintrop, S. (2011). Center-surround divergence of feature statistics for salient object detection. In 2011 International conference on computer vision (pp. 2214–2219). IEEE.
Lan, X., Gu, X., & Gu, X. (2022). Mmnet: Multi-modal multi-stage network for rgb-t image semantic segmentation. Applied Intelligence, 52(5), 5817–5829.
Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570). PMLR.
Lee, M., Park, C., Cho, S., & Lee, S. (2022). Spsn: Superpixel prototype sampling network for rgb-d salient object detection. In European conference on computer vision (pp. 630–647). Springer.
Li, G., Liu, Z., Chen, M., Bai, Z., Lin, W., & Ling, H. (2021). Hierarchical alternate interaction network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 3528–3542.
Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2806–2813).
Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., & Zhang, Z. (2022). Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection. Neurocomputing, 490, 132–145.
Liu, J. J., Hou, Q., Cheng, M. M., Feng, J., & Jiang, J. (2019). A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3917–3926).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
Liu, Z., Shi, S., Duan, Q., Zhang, W., & Zhao, P. (2019). Salient object detection for rgb-d image by single stream recurrent convolution neural network. Neurocomputing, 363, 46–57.
Liu, Z., Tan, Y., He, Q., & Xiao, Y. (2021). Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4486–4497.
Liu, Z., Wang, Y., Tu, Z., Xiao, Y., & Tang, B. (2021). Tritransnet: Rgb-d salient object detection with a triplet transformer embedding network. In Proceedings of the 29th ACM international conference on multimedia (pp. 4481–4490).
Luo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S., & Jodoin, P.-M. (2017). Non-local deep features for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6609–6617).
Ma, J., Tang, L., Xu, M., Zhang, H., & Xiao, G. (2021). Stdfusionnet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement, 70, 1–13.
Máttyus, G., Luo, W., & Urtasun, R. (2017). Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the IEEE international conference on computer vision (pp. 3438–3446).
Pang, Y., Zhang, L., Zhao, X., & Lu, H. (2020). Hierarchical dynamic filtering network for rgb-d salient object detection. In European conference on computer vision (pp. 235–252). Springer.
Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9413–9422).
Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2023). CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection. IEEE Transactions on Image Processing, 32, 892–904.
Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). Rgbd salient object detection: A benchmark and algorithms. In European conference on computer vision (pp. 92–109). Springer.
Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp. 733–740). IEEE.
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7254–7263).
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., & Jagersand, M. (2019). Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7479–7489).
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404.
Song, M., Song, W., Yang, G., & Chen, C. (2022). Improving RGB-D salient object detection via modality-aware decoder. IEEE Transactions on Image Processing, 31, 6124–6138.
Sun, Y., Zuo, W., Yun, P., Wang, H., & Liu, M. (2021). Fuseseg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Transactions on Automation Science and Engineering, 18(3), 1000–1011.
Wang, D., Liu, J., Liu, R., & Fan, X. (2023). An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Information Fusion, 98, 101828.
Wang, F., Pan, J., Xu, S., & Tang, J. (2022). Learning discriminative cross-modality features for rgb-d saliency detection. IEEE Transactions on Image Processing, 31, 1285–1297.
Wang, F., Wang, R., & Sun, F. (2023). Dcmnet: Discriminant and cross-modality network for RGB-D salient object detection. Expert Systems with Applications, 214, 119047.
Wang, L., Lu, H., Ruan, X., & Yang, M.-H. (2015). Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3183–3192).
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568–578).
Wei, J., Wang, S., & Huang, Q. (2020). F\(^3\)net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 12321–12328).
Wei, J., Wang, S., Wu, Z., Su, C., Huang, Q., & Tian, Q. (2020). Label decoupling framework for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13025–13034).
Woo, S., Park, J., Lee, J.-Y., & Kweon, I.S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Wu, J., Sun, F., Xu, R., Meng, J., & Wang, F. (2022). Aggregate interactive learning for rgb-d salient object detection. Expert Systems with Applications, 195, 116614.
Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., & Ding, E. (2019). A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8150–8159).
Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1155–1162).
Yang, Y., Qin, Q., Luo, Y., Liu, Y., Zhang, Q., & Han, J. (2022). Bi-directional progressive guidance network for RGB-D salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(8), 5346–5360.
Zhang, D., Meng, D., & Han, J. (2016). Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5), 865–878.
Zhang, L., Dai, J., Lu, H., He, Y., & Wang, G. (2018). A bi-directional message passing model for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1741–1750).
Zhang, M., Ren, W., Piao, Y., Rong, Z., & Lu, H. (2020). Select, supplement and focus for rgb-d saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3472–3481).
Zhang, M., Yao, S., Hu, B., Piao, Y., & Ji, W. (2023). C\(^2\)dfnet: Criss-cross dynamic filter network for rgb-d salient object detection. IEEE Transactions on Multimedia, 25, 5142–5154.
Zhang, P., Wang, D., Lu, H., Wang, H., & Ruan, X. (2017) Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE international conference on computer vision (pp. 202–211).
Zhang, Q., Zhao, S., Luo, Y., Zhang, D., Huang, N., & Han, J. (2021). Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2633–2642).
Zhao, J.-X., Cao, Y., Fan, D.-P., Cheng, M.-M., Li, X.-Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for rgbd salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3927–3936).
Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J., & Cheng, M.-M. (2019). Egnet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8779–8788).
Zhao, X., Pang, Y., Zhang, L., Lu, H., & Ruan, X. (2022). Self-supervised pretraining for rgb-d salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 3463–3471).
Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2020) Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision (pp. 35–51). Springer.
Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020). A single stream network for robust and real-time rgb-d salient object detection. In European conference on computer vision (pp. 646–662). Springer.
Zhou, L., Gong, C., Liu, Z., & Fu, K. (2020). Sal: Selection and attention losses for weakly supervised semantic segmentation. IEEE Transactions on Multimedia, 23, 1035–1048.
Zhou, W., Dong, S., Xu, C., & Qian, Y. (2022). Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 3571–3579).
Zhou, W., Guo, Q., Lei, J., Yu, L., & Hwang, J.-N. (2021). Ecffnet: Effective and consistent feature fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1224–1235.
Zhu, C., Cai, X., Huang, K., Li, T. H., & Li, G. (2019). Pdnet: Prior-model guided depth-enhanced network for salient object detection. In 2019 IEEE international conference on multimedia and expo (ICME) (pp. 199–204). IEEE.
Zhuge, M., Fan, D.-P., Liu, N., Zhang, D., Xu, D., & Shao, L. (2022). Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3738–3752.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grants 61976042 and 61972068 and by the Liaoning Revitalization Talents Program under Grant XLYC2007023.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Massimiliano Mancini.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, X., Sun, F., Sun, J. et al. Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02020-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11263-024-02020-y