A deep multimodal feature learning network for RGB-D salient object detection,☆☆

https://doi.org/10.1016/j.compeleceng.2021.107006Get rights and content

Abstract

In this paper, we propose a deep multimodal feature learning (DMFL) network for RGB-D salient object detection. The color and depth features are firstly extracted from low level to high level feature using CNN. Then the features at the high layer are shared and concatenated to construct joint feature representation of multi-modalities. The fused features are embedded to a high dimension metric space to express the salient and non-salient parts. And also a new objective function, consisting of cross-entropy and metric loss, is proposed to optimize the model. Both pixel and attribute level discriminative features are learned for semantical grouping to detect the salient objects. Experimental results show that the proposed model achieves promising performance and has about 1% to 2% improvement to conventional methods.

Introduction

Visual saliency aims to determine the parts of an image where the attention of human is attracted [1]. As depth images provide additional cues compared with color images, using depth information to detect salient objects in RGB-D images has gradually aroused people’s interest. Many efforts have been made on the issue, and the researches have been successfully applied to a broad range of practice [2], [3], [4].

Many methods has been developed to promote the advancement of artificial intelligence [5], [6], [7], [8]. Among them, CNNs are popularly used in feature extraction [9], [10]. Low-level features used in conventional saliency detection models cannot reflect high-level cues of objects, and therefore CNNs is widely adopted to locate salient objects by extracting high level information [11], [12], [13], [14], [15]. The CNN-based saliency detection models tend to learn the representation of the concatenated features of two modalities of RGB-D images. These models generally identify one pixel to be salient according to discriminative features obtained by minimizing the cross-entropy loss. With the concatenation operation, these models have limited capability of discovering the common and complementary features. With the cross-entropy loss function, these models have the limited capability of assigning homogeneous labels to pixels [16]. Therefore, on the one hand, it is necessary to find the way of effectively extracting features; on the other hand, it is necessary to describe the semantic attributes of salient and non-salient parts.

To extract those effective features, some models have been developed. As learning complementary features of multimodality separately is exhaustive, in other tasks of computer vision (for example, classification), features are commonly decomposed and distributed to specific model [17], [18]. However, this way of feature decomposition has not been developed in the task of saliency detection.

To describe the semantic attributes of regions, metric learning has been introduced in some recent works. In view of the idea of metric learning [19], [20], the samples with short distances are put together, whereas those samples with long distances are separated. It has also been introduced to the saliency detection models [21] for 2D images, in which the pixels with the similar attributes of the salient regions and non-salient regions are supposed to be salient and non-salient, respectively. However, whether an object is salient or not is determined to a certain extent by the depth where it is located. That is to say, the saliency detection model for RGB-D images should consider the distance of features from both depth and color for salient region detection. Hence, learning the metric distance of multimodal representation of features is important for RGB-D saliency detection.

In this paper, we propose a deep multimodal feature learning (DMFL) network for RGB-D salient object detection in an implicit metric space. Inspired by the work [22], we extract both features of different modalities and the sharable features from two modalities as the complementary information. Different from the traditional methods of mapping pixels to labels, we specially map the extracted features to a metric space for the model generalization. A new objective function is designed and optimized to extract both the specific features and the sharable features of two modalities by imposing a metric loss on the loss of cross-entropy. We use the metric loss regulation term to ensure that the proposed model can judge whether a pixel is salient or non-salient by a metric distance. In the experiments, it is demonstrated that the discriminative features can improve the performance of RGB-D saliency detection. In a nutshell, the contributions of this paper are summarized as follows:

  • We propose a deep multimodal feature learning network for RGB-D salient object detection, which considers the attributes of pixels and regions in image-itself;

  • We make the early effort to introduce metric learning for representing the implicit attribute of regions using color and depth cues;

  • The features of intra-and inter-modalities are combined together to explore the complementary information for the saliency cues of multimodalities;

  • Experimental results from public datasets demonstrate the effectiveness of the proposed model.

The rest of this paper is organized as follows. In Section 2, a review of the related saliency models is given. In Section 3, the proposed model is described in detail. In Section 4, the experimental results are presented and analyzed. Conclusions are provided in Section 5.

Section snippets

Related works

The existing methods for RGB-D saliency detection are roughly classified into two categories: methods based on handcrafted cues and the learning-based methods. We briefly review these two kinds of saliency detection models for RGB-D images. As the feature learning in the proposed model involves metric learning, we also present a short review of the related works involving metric learning of features.

The proposed model

Fig. 1 gives the illustration of our proposed model, which aims to explore the metric space in the aspect of semantic image-itself to boost saliency detection for RGB-D images. As shown in Fig. 1, the proposed DMFL consists of three main steps: feature extraction, feature fusion and joint feature learning. In the first step, given RGB and depth images, CNN features from each modality are extracted, to represent single modal property. In the second step, features from color and depth modalities

Dataset describtion

The proposed model DMFL is evaluated on two public RGB-D datasets: NJUDS2000 [11] and SSB [41]. These two datasets contain 2000 and 1000 pairs of stereoscopic images, the corresponding depth maps and the ground-truth maps, respectively. The stereo images are collected from the Internet and 3D movies. The depth maps are computed with an optical flow and the ground-truth maps are labeled manually. The image from SSB dataset mainly contains one salient object whereas the image from NJUDS generally

Conclusions

Great progress has been made on RGB-D saliency detection using Convolutional Neural Networks (CNNs). However, two issues still need to be considered. First, using the cross-entropy loss, the data for training RGB-D saliency detection model is not big enough to make the model robust to various scenarios. Second, complementary information between different modalities are not exploited effectively by concatenating the multimodal features. To address these problems, in this paper, we made the early

CRediT authorship contribution statement

Fangfang Liang: Conceptualization, Methodology, Writing - original draft. Lijuan Duan: Supervision, Investigation. Wei Ma: Visualization. Yuanhua Qiao: Writing - review & editing. Jun Miao: Validation.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107006.

Acknowledgments

This research was partially supported by Projects of the Beijing Municipal Education Commission, China [grant number KZ201910005008], National Natural Science Foundation of China [grant numbers 61672070 and 61771026], the Beijing Municipal Education Commission, China [grant number KM201911232003], the Research Fund of Beijing Innovation Center for Future Chips, China [grant number KYJ2018004] and the Beijing Natural Science Foundation, China [grant number 4202025].

References (44)

  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • WangJ. et al.

    CNN-Based color image encryption algorithm using DNA sequence operations

  • ZhangY. et al.

    Chinese medical question answer matching with stack-CNN

  • SongH. et al.

    Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning

    IEEE Trans Image Process

    (2017)
  • QuL. et al.

    RGBD Salient object detection via deep fusion

    IEEE Trans Image Process

    (2017)
  • XiaoX. et al.

    RGB-”D” Saliency detection with pseudo depth

    IEEE Trans Image Process

    (2019)
  • ZengY. et al.

    Learning to promote saliency detectors

    (2018)
  • Li Y, Zhang J, Cheng Y, Huang K, Tan T. DF 2 Net: Discriminative feature learning and fusion network for RGB-D indoor...
  • Zhu H, Weibel J-B, Lu S. Discriminative multi-modal feature fusion for rgbd indoor scene recognition. In: Proceedings...
  • HanJ. et al.

    A unified metric learning-based framework for co-saliency detection

    IEEE Trans Circuits Syst Video Technol

    (2017)
  • XuX. et al.

    Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval

    IEEE Trans Cybern

    (2019)
  • Cai S, Huang J, Zeng D, Ding X, Paisley JW. MEnet: a metric expression network for salient object segmentation. arXiv:...
  • Cited by (0)

    This paper was recommended for publication by associate editor Manu Malek.

    ☆☆

    This paper is for CAEE special section VSI-aicv4. Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. Yujie Li.

    View full text