A deep multimodal feature learning network for RGB-D salient object detection☆,☆☆
Introduction
Visual saliency aims to determine the parts of an image where the attention of human is attracted [1]. As depth images provide additional cues compared with color images, using depth information to detect salient objects in RGB-D images has gradually aroused people’s interest. Many efforts have been made on the issue, and the researches have been successfully applied to a broad range of practice [2], [3], [4].
Many methods has been developed to promote the advancement of artificial intelligence [5], [6], [7], [8]. Among them, CNNs are popularly used in feature extraction [9], [10]. Low-level features used in conventional saliency detection models cannot reflect high-level cues of objects, and therefore CNNs is widely adopted to locate salient objects by extracting high level information [11], [12], [13], [14], [15]. The CNN-based saliency detection models tend to learn the representation of the concatenated features of two modalities of RGB-D images. These models generally identify one pixel to be salient according to discriminative features obtained by minimizing the cross-entropy loss. With the concatenation operation, these models have limited capability of discovering the common and complementary features. With the cross-entropy loss function, these models have the limited capability of assigning homogeneous labels to pixels [16]. Therefore, on the one hand, it is necessary to find the way of effectively extracting features; on the other hand, it is necessary to describe the semantic attributes of salient and non-salient parts.
To extract those effective features, some models have been developed. As learning complementary features of multimodality separately is exhaustive, in other tasks of computer vision (for example, classification), features are commonly decomposed and distributed to specific model [17], [18]. However, this way of feature decomposition has not been developed in the task of saliency detection.
To describe the semantic attributes of regions, metric learning has been introduced in some recent works. In view of the idea of metric learning [19], [20], the samples with short distances are put together, whereas those samples with long distances are separated. It has also been introduced to the saliency detection models [21] for 2D images, in which the pixels with the similar attributes of the salient regions and non-salient regions are supposed to be salient and non-salient, respectively. However, whether an object is salient or not is determined to a certain extent by the depth where it is located. That is to say, the saliency detection model for RGB-D images should consider the distance of features from both depth and color for salient region detection. Hence, learning the metric distance of multimodal representation of features is important for RGB-D saliency detection.
In this paper, we propose a deep multimodal feature learning (DMFL) network for RGB-D salient object detection in an implicit metric space. Inspired by the work [22], we extract both features of different modalities and the sharable features from two modalities as the complementary information. Different from the traditional methods of mapping pixels to labels, we specially map the extracted features to a metric space for the model generalization. A new objective function is designed and optimized to extract both the specific features and the sharable features of two modalities by imposing a metric loss on the loss of cross-entropy. We use the metric loss regulation term to ensure that the proposed model can judge whether a pixel is salient or non-salient by a metric distance. In the experiments, it is demonstrated that the discriminative features can improve the performance of RGB-D saliency detection. In a nutshell, the contributions of this paper are summarized as follows:
- •
We propose a deep multimodal feature learning network for RGB-D salient object detection, which considers the attributes of pixels and regions in image-itself;
- •
We make the early effort to introduce metric learning for representing the implicit attribute of regions using color and depth cues;
- •
The features of intra-and inter-modalities are combined together to explore the complementary information for the saliency cues of multimodalities;
- •
Experimental results from public datasets demonstrate the effectiveness of the proposed model.
The rest of this paper is organized as follows. In Section 2, a review of the related saliency models is given. In Section 3, the proposed model is described in detail. In Section 4, the experimental results are presented and analyzed. Conclusions are provided in Section 5.
Section snippets
Related works
The existing methods for RGB-D saliency detection are roughly classified into two categories: methods based on handcrafted cues and the learning-based methods. We briefly review these two kinds of saliency detection models for RGB-D images. As the feature learning in the proposed model involves metric learning, we also present a short review of the related works involving metric learning of features.
The proposed model
Fig. 1 gives the illustration of our proposed model, which aims to explore the metric space in the aspect of semantic image-itself to boost saliency detection for RGB-D images. As shown in Fig. 1, the proposed DMFL consists of three main steps: feature extraction, feature fusion and joint feature learning. In the first step, given RGB and depth images, CNN features from each modality are extracted, to represent single modal property. In the second step, features from color and depth modalities
Dataset describtion
The proposed model DMFL is evaluated on two public RGB-D datasets: NJUDS2000 [11] and SSB [41]. These two datasets contain 2000 and 1000 pairs of stereoscopic images, the corresponding depth maps and the ground-truth maps, respectively. The stereo images are collected from the Internet and 3D movies. The depth maps are computed with an optical flow and the ground-truth maps are labeled manually. The image from SSB dataset mainly contains one salient object whereas the image from NJUDS generally
Conclusions
Great progress has been made on RGB-D saliency detection using Convolutional Neural Networks (CNNs). However, two issues still need to be considered. First, using the cross-entropy loss, the data for training RGB-D saliency detection model is not big enough to make the model robust to various scenarios. Second, complementary information between different modalities are not exploited effectively by concatenating the multimodal features. To address these problems, in this paper, we made the early
CRediT authorship contribution statement
Fangfang Liang: Conceptualization, Methodology, Writing - original draft. Lijuan Duan: Supervision, Investigation. Wei Ma: Visualization. Yuanhua Qiao: Writing - review & editing. Jun Miao: Validation.
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107006.
Acknowledgments
This research was partially supported by Projects of the Beijing Municipal Education Commission, China [grant number KZ201910005008], National Natural Science Foundation of China [grant numbers 61672070 and 61771026], the Beijing Municipal Education Commission, China [grant number KM201911232003], the Research Fund of Beijing Innovation Center for Future Chips, China [grant number KYJ2018004] and the Beijing Natural Science Foundation, China [grant number 4202025].
References (44)
- et al.
Saliency-based stereoscopic image retargeting
Inform Sci
(2016) - et al.
Depth-aware salient object detection using anisotropic center-surround difference
Signal Process., Image Commun.
(2015) - et al.
Stereoscopic saliency model using contrast and depth-guided-background prior
Neurocomputing
(2018) - et al.
Salient object detection for RGB-D image via saliency evolution
- et al.
Interaction between bottom-up saliency and top-down control: How saliency maps are created in the human brain
Cerebral Cortex
(2012) - et al.
Stereoscopic thumbnail creation via efficient stereo saliency detection
IEEE Trans Vis Comput Graphics
(2017) - Khan S, Channappayya SS. Estimating depth-salient edges and its application to stereoscopic image quality assessment....
- et al.
Brain intelligence: go beyond artificial intelligence
Mob Netw Appl
(2018) - et al.
Motor anomaly detection for unmanned aerial vehicles using reinforcement learning
IEEE Internet Things J.
(2017) - et al.
CONet: A Cognitive ocean network
(2019)
Imagenet classification with deep convolutional neural networks
CNN-Based color image encryption algorithm using DNA sequence operations
Chinese medical question answer matching with stack-CNN
Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning
IEEE Trans Image Process
RGBD Salient object detection via deep fusion
IEEE Trans Image Process
RGB-”D” Saliency detection with pseudo depth
IEEE Trans Image Process
Learning to promote saliency detectors
A unified metric learning-based framework for co-saliency detection
IEEE Trans Circuits Syst Video Technol
Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval
IEEE Trans Cybern
Cited by (0)
- ☆
This paper was recommended for publication by associate editor Manu Malek.
- ☆☆
This paper is for CAEE special section VSI-aicv4. Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. Yujie Li.