Abstract

This article proposes an innovative RGBD saliency model, that is, attention-guided feature integration network, which can extract and fuse features and perform saliency inference. Specifically, the model first extracts multimodal and level deep features. Then, a series of attention modules are deployed to the multilevel RGB and depth features, yielding enhanced deep features. Next, the enhanced multimodal deep features are hierarchically fused. Lastly, the RGB and depth boundary features, that is, low-level spatial details, are added to the integrated feature to perform saliency inference. The key points of the AFI-Net are the attention-guided feature enhancement and the boundary-aware saliency inference, where the attention module indicates salient objects coarsely, and the boundary information is used to equip the deep feature with more spatial details. Therefore, salient objects are well characterized, that is, well highlighted. The comprehensive experiments on five challenging public RGBD datasets clearly exhibit the superiority and effectiveness of the proposed AFI-Net.

1. Introduction

RGBD salient object detection tries to utilize the pair of RGB and depth images to highlight the visually fascinating regions in RGBD scenarios. Especially with the fast progress of RGBD hardware sensing equipment, such as the traditional Microsoft Kinect, modern smart phones, the advanced time-of-flight camera, and the motion capturing system [1, 2], depth information can be acquired continently and has played an crucial role in many related areas, such as scene understanding [3], semantic segmentation [4], RGBD saliency detection [5], ship detection [68], traffic signs detection [9,10], image thumbnails [11], and hand posture detection [12]. Thus, RGBD saliency detection has also received considerable attention, and significant efforts [5, 1323] have also been exerted on this research area.

However, RGBD saliency models mainly rely on hand-crafted features, such as contrast computation [5, 13], minimum barrier distance computation [22], and the cellular automata model [20]. The performance of these RGBD saliency models degrade largely when handling some complex RGBD scenes with attributions, including small salient objects, heterogeneous objects, cluttered background, and low contrast. This phenomenon can be attributed to the weak representation ability of hand-crafted features in RGBD saliency models. Fortunately, significant progress has been achieved in deep learning theories in the past few years. In particular, convolutional neural networks (CNNs), which provided high level semantic cues, have been applied in RGBD saliency detection successfully [2435], such as the three stream structure in [34], the fluid pyramid integration in [35], and the complementary fusion in [28].

Though the performance of existing deep learning-based RGBD saliency models is encouraging, they still lose their efficiency when dealing with complex RGBD scenes. Thus, the performance in the area of RGBD saliency detection can still be improved. In addition, some fusion-based RGBD saliency models [5, 14, 15, 19, 27, 28, 32, 33] aim to integrate two modalities, namely, RGB and depth information, through early fusion, middle fusion, and result fusion. These models often result in cross-modal distribution gap or information drop, leading to performance degradation. Meanwhile, the attention mechanism [36] has been widely adopted in many saliency models [3739], enhancing the saliency detection performance in RGB image scenes. Furthermore, boundary information has been applied in salient object detection [40, 41], providing more spatial details for salient objects.

Thus, this work proposes an innovative end-to-end RGBD saliency model, that is, attention-guided feature integration network (AFI-Net). AFI-Net can extract and fuse features and perform saliency inference. Specifically, our model first extracts multimodal and level deep features, with the pair of RGB and depth images as the input. Then, the attention module, where the attention mechanism [36] is adopted to generate the attention map, enhances the multilevel RGB and depth features, yielding enhanced deep features. Next, these enhanced features (originated from different modalities and levels) are fused hierarchically. Lastly, the RGB and depth boundary features, that is, low-level spatial details, and the integrated feature are combined to perform saliency inference, yielding a high-quality saliency map. Our model focuses on RGBD saliency detection, whereas the existing boundary-aware saliency models [40, 41] focus on performing saliency detection in RGB images.

More importantly, the key advantages of the AFI-Net are the attention model, which indicates the salient objects coarsely, and the boundary information, which provides more spatial details for features. Thus, we can characterize salient objects perfectly in RGBD scenes. The general contributions of AFI-Net are described as follows:(1)We propose AFI-Net to highlight the salient objects in RGBD images. The AFI-Net has three components, the extraction and fusion of features, and saliency inference.(2)To sufficiently utilize deep features from different modalities and levels, the attention module is employed to enhance deep features and guide the hierarchical feature fusion. Furthermore, the spatial details are further embedded in the saliency inference step to obtain accurate boundary details.(3)We perform exhaustive experiments on five public RGBD datasets, and our model achieves the state-of-the-art performance. The experiments also validate the effectiveness of the proposed AFI-Net.

The pioneer effort [42] defined saliency detection using the center-surround difference mechanism, and succeeding works constructed many saliency models to detect salient objects in natural scene RGB images. Meanwhile, the research on RGBD saliency detection [43, 44] has also been pushed forward significantly in recent decades. Many RGBD saliency models exist, including heuristic models [5, 1323] and deep learning-based models [2435, 45], which have achieved encouraging performance. Following, we introduce some of the existing RGBD saliency models.

In [14], color contrast, depth contrast, and spatial bias are combined to generate saliency maps. In [5], luminance-, color-, depth-, and texture-based features are used to produce contrast maps, which are combined to compute for the final saliency map using weighted summation. In [15], the features maps computed by using region grouping, contrast, and location and scale are combined to conduct RGBD salient object detection. In [19], compactness saliency maps computed using color and depth information are aggregated into a saliency map via the weighted summation approach. In [20], color- and depth-based saliency maps are integrated and improved via the linear summation and cellular automata. In [24], various feature vectors, such as local contrast, global contrast, and background prior, are generated and fused to infer the saliency value of each superpixel.

With the wide deployment of CNNs, the performance of RGBD saliency models is significantly advanced. In [25], depth features are combined with appearance features using the fully connected layers, generating high-performance saliency maps. In [27, 28], a two-stream architecture is proposed with a fusion network to detect the complementarities of RGB and depth cues. In [31], two networks, namely, a master network and a subnetwork, are used to obtain deep RGB and depth features, respectively. In [29], RGBD salient object detection is performed using a recurrent CNN. In [35], multilevel features are fused and used to detect salient objects using a fluid pyramid network. In [33], two-stream networks interact to further explore the complementarity of multimodal deep features. In [32], a fusion module is employed to fuse the RGB and depth-based saliency results.

3. Methodology

First, the proposed AFI-Net is introduced in Section 3.1. Then, the feature extraction is presented in Section 3.2. Subsequently, feature fusion and saliency inference are described in Section 3.3. Finally, in Section 3.4, some implementation details are introduced.

3.1. Overall Architecture

Figure 1 summarizes our RGBD saliency model, AFI-Net, which includes a two-stream-based encoder (i.e., feature extraction), a single branch-based decoder (i.e., feature fusion), and saliency inference. Specifically, the entire network is constructed based on VGG-16 [46] with an end-to-end structure. RGB image and depth map are used as the input to AFI-Net. Here, the initial depth map is encoded into an HHA map using [47]. Then, RGB image and depth map are sent to the two-stream network. Thus, we can obtain multilevel initial deep RGB features and deep depth features , which correspond to the different convolutional blocks in each branch. Subsequently, a series of attention modules are deployed to enhance the initial deep features, yielding enhanced deep RGB features and deep depth features . Next, the fusion branch is used to integrate the enhanced RGB and depth features hierarchically, and we can obtain integrated deep features . Finally, the saliency inference module is employed to obtain a saliency map by aggregating the boundary information, that is, the low-level spatial details. In Section 3.2, we elaborate the proposed RGBD saliency model, AFI-Net.

3.2. Feature Extraction

The feature extraction branch, namely, an encoder, is a two-stream network containing RGB and depth branches constructed based on VGG-16 [46]. Specifically, the RGB and depth branches have five convolutional blocks with 13 convolutional layers (kernel size =  and stride size = 1) and 4 max pooling layers (pooling size =  and stride size = 2). Here, considering the inherent difference of and , the RGB and depth branches have the same structure with different weights. Following this pipeline, we can obtain the initial multiple modalities and the multilevel features including the deep RGB features and the deep depth features , as shown in Figure 1.

On the basis of multimodal features and , we first deploy the attention module, as shown in Figure 2, to further enhance the initial deep features. Formally, we denote each initial deep RGB feature or deep depth feature as for convenience. According to Figure 2(b), attention feature is formulated as follows:where denotes a convolutional layer. Then, we can compute the attention weight at each spatial location using softmax as shown in Figure 2(b):where denotes the spatial coordinates of attention feature and the width and height of are denoted as . Notably, .

After obtaining attention map , initial deep feature should be selected, which is formulated as follows:where is the Hadamard matrix product in the channel direction and is the enhanced deep feature. Thus, we can generate the enhanced deep RGB features and the enhanced deep depth features .

3.3. Feature Fusion and Saliency Inference

To integrate the enhanced RGB features and the enhanced depth features , the fusion branch, that is, the decoder, is deployed to fuse the multimodal and level deep features hierarchically, as shown in Figure 1. Specifically, the hierarchical integration operation is performed as follows:where H denotes the fusion and contains one convolutional layer and one upsampling layer, [.] denotes channel-wise concatenation, and is the integrated deep feature.

According to the descriptions, we can obtain the first integrated deep feature . On basis of , we aggregate it with the low-level spatial detail features, that is, the boundary information, to obtain a saliency map. Specifically, as shown in Figure 1, boundary information and can be obtained from the bottom layer conv1-2 in the RGB and depth branches, respectively, by using a convolutional layer , that is, a boundary module (BI box marked in yellow). Subsequently, the saliency prediction is performed by using two convolutional layers and one softmax layer. Thus, the saliency inference is written as follows:where the RGBD saliency map is represented by , [.] denotes the channel-wise concatenation operation, and refers to the convolutional layers and the one softmax layer.

3.4. Implementation Details

AFI-Net includes feature extraction, feature fusion, and saliency inference. Concretely, is the training dataset, where , , and refer to the RGB image, the depth map, and the ground truth with pixels, respectively. Here, subscript is dropped, and corresponds to each RGB image and depth map pair. Thus, the total loss can be written as follows:where the kernel weights and bias of the convolutional layers are denoted as and , respectively; and denote the salient objects and backgrounds, respectively; and is the ratio of salient objects’ pixels in , that is, . Furthermore, is the saliency value of each pixel.

AFI-Net is implemented using the Caffe toolbox [48]. During the training phase, the parameters of the SGD algorithm, such as momentum, base learning rate, minibatch size, and weight decay, are set to 0.9, , 32, and 0.0001, respectively. Our total iterations are set to . Furthermore, the learning rate is divided by 10 at the beginning of each iterations. The VGG-16 model is used to initialize the weights of the RGB and depth branches. The fusion branch is initialized by using the “msra” method [49]. In addition, the training data used by CPFP [35] is also employed to train our model. The training data contain 1400 pairs of RGB and depth images from NJU2K [16] and 650 pairs of RGB and depth images from NLPR [15]. Obviously, augmentation operations are also adopted, including rotation with angles , and and mirroring. Finally, the number of training samples is 10250. After the training phase, we can obtain the final model with 131.2 MB. During the test phase, the average processing time per 288 288 image is 0.2512 s.

4. Experiments

The public RGBD datasets and the comprehensive evaluation metrics are described in Section 4.1. In Section 4.2, exhaustive quantitative and qualitative comparisons are performed successively. Lastly, the ablation analysis is presented in Section 4.3.

4.1. Experimental Setup

To validate the proposed AFI-Net, we perform comprehensive experiments on five challenging RGBD datasets, namely, NJU2K [16], NLPR [15], STEREO [13], LFSD [50], and DES [14]. NJU2K includes 2003 samples, which are captured from the Internet, daily routines, and so on. From the dataset, 1400 samples are employed for training and 485 samples for testing, that is, “NJU2K-TE.” NLPR was constructed by Microsoft Kinect, consisting of 1000 samples, and the salient objects in some samples are more than one. For training the AFI-Net, 650 samples are selected from NLPR to construct the training set, and 300 samples are selected from NLPR to build the testing set, that is, “NLPR-TE.” STEREO has 1000 samples, which are used as the testing set. LFSD and DES consist of 100 and 135 samples, which are all used as the testing set. All the datasets are equipped with pixelwise annotation. To compare the RGBD saliency models quantitatively, max F-measure (max F), S-measure (S) [51], mean absolute error (MAE), and max E-measure (max E) [52] are utilized in this paper.

S-measure considers the region aware value and the object aware value simultaneously, measuring the structural similarity between the ground truth and the saliency map. Referring to [51], the formulation is defined as follows:where is a balance parameter (here, we set it to 0.5).

F-measure is the weighted harmonic mean of precision and recall and is formulated as follows:where is set to 0.3. Max F-measure could be obtained using different thresholds [0, 255].

E-measure denotes the enhanced-alignment measure considering the local details and the global information. Referring to [52], E-measure can be written as follows:where denotes the convex function, denotes the Hadamard product, and is the alignment matrix.

MAE measures the difference between ground truth and saliency map :where the obtained saliency maps are scaled to [0, 1], is the width of the saliency map, and denotes the height.

4.2. Comparison with the State-of-the-Art Models

A comparison is first made on NJU2K-TE, NLPR-TE, STEREO, LFSD, and DES between AFI-Net and nine state-of-the-art RGBD saliency models, namely, CDCP [23], ACSD [16], LBE [18], DCMC [19], SE [20], MDSF [21], DF [24], AFNet [32], and CTMF [27]. The traditional heuristic RGBD saliency models represented by the first six RGBD saliency models and the last three RGBD saliency models are CNNs-based RGBD saliency models. Here, the saliency maps of the other models are provided by the authors or obtained through the source codes. Next, the quantitative and qualitative comparisons are presented. Specifically, Table 1 shows the quantitative comparison results on five RGBD datasets. AFI-Net outperforms the nine state-of-art RGBD saliency models in terms of all the evaluation metrics.

Figure 3 presents the qualitative comparisons on some complex scenes. AFI-Net achieves superior performance over the nine state-of-the-art models. Specifically, the first example presents a box on the ground, where the box in the depth map is indistinctive. The other models shown in Figures 3(e)3(m) falsely highlight some backgrounds and cannot pop-out the box completely. In the second example, the vase is a heterogeneous object, and its bottom is unclear in the depth map. Our model (Figure 3(d)) performs better than the other models though the top part is not popped-out completely. Like the first example, the third and the fourth examples not only show an unclear depth map but also present a cluttered background. Fortunately, our model still highlights the bird and the cow completely and accurately. The fifth and sixth examples show multiple salient objects. AFI-Net exhibits the best performance, as shown in Figure 3(d). In the seventh example, the man is in the image boundary, and its corresponding depth is also unclear. Under this condition, our model still performs better than the others though some backgrounds are also highlighted mistakenly. For the and rows, the salient objects occupy most regions of the images. Our model and the AFNet achieve comparable performance, as shown in Figures 3(d) and 3(m). The and the rows also show a cluttered background. Obviously, AFI-Net still exhibits the best performance as shown in Figure 3(d).

Generally, through the extensive comparison of AFI-Net and nine state-of-the-art models, we can demonstrate the proposed AFI-Net’s effectiveness.

4.3. Ablation Studies

Here, the intensive study on some key components in AFI-Net is presented quantitatively and qualitatively. Specifically, the crucial points in AFI-Net include the attention module (AM) and the boundary module (BI), as shown in Figure 1. Therefore, we design two variants of our model, namely, AFI-Net without the attention module (denoted as “w/oA”) and the AFI-Net without the boundary module (denoted as “w/oB”). Correspondingly, we perform comprehensive comparisons between our model and the two variants.

First, the quantitative comparison results are presented in Table 2. Clearly, AFI-Net consistently outperforms the two variants, “w/oA” and “w/oB,” on two RGBD datasets. Secondly, the qualitative comparison results are presented in Figure 4. AFI-Net (shown in Figure 4(d)) performs better than the two variants (shown in Figures 4(e)) and 4(f)). The results of AFI-Net have well-defined boundaries and highlight the salient objects completely. In contrast, the two variants falsely highlight some backgrounds and cannot detect the salient objects completely.

Overall, the attention and boundary modules play an important role in AFI-Net, enhancing the deep features and equipping them with more spatial details. Meanwhile, the results clearly validate the rationality and effectiveness of the two components in the proposed AFI-Net.

4.4. Failure Case Analysis

In the experiments, we demonstrate the effectiveness and rationality of the proposed AFI-Net. Particularly, Figure 3 shows the qualitative comparison between the proposed AFI-Net and the state-of-the-art saliency models, highlighting the effectiveness of the proposed AFI-Net. However, in some challenging images, our model cannot detect salient objects well, as shown in Figure 5(d). Specifically, in Figure 5, the first example shows a traffic sign, and all models fail to pop-out the salient object. The second example is a pendant, which is highlighted incompletely by most of the models. In the third example, which presents a pavilion, all models falsely pop-out the background regions. In the last two examples, the car and the pot culture cannot be detected accurately and completely. Although our model fails to highlight the salient objects of these examples, it can still pop-out the main part of the salient objects shown in Figure 5(d) better than the other models (shown in Figure 5(e))–5(m)) because our model contains an effective attention module, which covers the main parts of the salient objects. Generally, the research on RGBD saliency detection still faces many difficulties, and the research on the complex scene images is worthy of attention.

5. Conclusion

This work proposes an innovative RGBD saliency model AFI-Net, which can perform feature extraction, feature fusion, and saliency inference. Specifically, the generated initial multimodal and multilevel features are first promoted by a series of attention modules, which select the initial deep features and coarsely indicate the location of salient objects. Then, the hierarchical fusion branch is adopted to fuse the enhanced deep features, which are further combined with low-level spatial detail features (i.e., the boundary information) to perform saliency inference. Thus, the generated saliency maps can highlight salient objects and preserve sharp boundaries. The experiments results on five public RGBD datasets indicate that the proposed AFI-Net obtains superior performance over nine state-of-the-art models.

Data Availability

Previously reported data were used to support this study and are available at https://doi.org/10.1109/cvpr.2012.6247708; https://doi.org/10.1145/2632856.2632866; https://doi.org/10.1007/978-3-319-10578-9_7; https://doi.org/10.1109/icip.2014.7025222; and https://doi.org/10.1109/cvpr.2014.359. These prior studies (and datasets) are cited at relevant places within the text as references [13, 14, 15, 16, 50].

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (nos. 51975347 and 51907117) and the Key Science and Technology Support Project of the Shanghai Science and Technology Commission (no. 18030501300).