Salient object detection via a boundary-guided graph structure

https://doi.org/10.1016/j.jvcir.2021.103048Get rights and content

Highlights

  • We construct an effective coarse saliency map by proposing a novel weighting mechanism to integrate two prior maps.

  • A novel boundary-guided graph is proposed to better explore the intrinsic relevance between superpixels.

  • We propose an iterative propagation mechanism to further refine the coarse saliency map.

  • Experimental results on four datasets demonstrate the superiority of the proposed method than other graph-based methods.

Abstract

Graph-based salient object detection methods have gained more and more attention recently. However, existing works fail to separate effectively salient object and background in some challenging scenes. Inspired by this observation, we propose an effective salient object detection method based on a novel boundary-guided graph structure. More specifically, the input image is firstly segmented into a series of superpixels. Then we integrate two prior cues to generate the coarse saliency map, a novel weighting mechanism is proposed to balance the proportion of two prior cues according to their performance. Secondly, we propose a novel boundary-guided graph structure to explore deeply the intrinsic relevance between superpixels. Based on the proposed graph structure, an iterative propagation mechanism is constructed to refine the coarse saliency map. Experimental results on four datasets show adequately the superiority of the proposed method than other state-of-the-art methods.

Introduction

When observing an image, the visual attention mechanism indicates that human eye tends to be concerned about the most interesting regions instead of the whole image. This phenomenon is usually called the saliency of image in computer vision field. Correspondingly, saliency detection task aims at locating the most interesting regions for human eye in the image. Specifically, given an image, saliency detection attempts to construct effective model to generate a corresponding saliency map. The value of each pixel in saliency map corresponds to the likelihood of it being the most interesting region. This operator is no doubt to contribute to subsequent computer vision tasks, such as Internet of things [1], [46], visual tracking [48], etc. Generally, saliency detection problem has two different research directions: eye fixation prediction and salient object detection. The former focuses on matching human eye movements and the observation of the entire salient object is not necessary. In contrast, the goal of the latter is to detect completely salient objects, which ensures that the entire object can be uniformly highlighted. Note that we focus on the problem of salient object detection. It’s emphasized that existing salient object detection algorithms are categorized into top-down and bottom-up strategies.

Top-down methods are usually task-driven and they concentrate on learning salient object detection model from numerous training images, and so supervision information is very necessary. Deep learning-based methods, which are the main branch of top-down salient object detection methods, have achieved great progress in recent years. e.g., Wang et al. [2] generate saliency map via building two complementary CNN frameworks, which are based on local patch features and global contrast-geometric features. Li et al. [45] construct an effective deep contrast learning network (DCL) which is enough discriminative to capture the feature contrast between salient object and background. He et al. [3] propose a CNN framework named Super-CNN, in which superpixel is regarded as the basic unit of saliency computation. Under multiple source information, Zeng et al. [29] utilize weakly supervised learning framework to train an effective CNN model. For saliency cues extraction, Qian et al. [37] propose a novel feature matching network incorporating natural language and image information. In addition, Mohammadi et al. [44] present a novel feature guided network via integrating effectively low-level and high-level features. Contextual encoder-decoder network is built to predict the saliency cue of an image in [47]. To reduce computational-consuming and memory-consuming, Gao et al. [31] build a specially-designed cloud-edge distributed network to boost detection performance. Furthermore, Qin et al. [4] obtain low-level color features and high-level deep features originated from pre-trained Fully Convolutional Network (FCN) [5], and fuse them to calculate the feature contrast between image regions during saliency propagation process. For the purpose of performance improvement, Zeng et al. [6] attempt to integrate the low-level features and high-level features extracted from the pre-trained VGG16 net [7], in order to achieve a more favorable integration result, an unsupervised iterative random walk model is proposed. However, it’s not hard to find that training a CNN model is less economical than bottom-up ones, since a lot of training data with manual annotation are hardly obtained while the whole training process is usually very time-consuming.

In contrast, bottom-up methods don’t involve complicated annotation work and training process. Generally, these bottom-up methods extract directly saliency cues using a larger number of prior knowledge, such as contrast prior, background prior, center prior and so on. More specifically, contrast prior is based on the observation that human eye tends to detect regions with high feature contrast in the whole scene. e.g., Itti et al. [8] compute the local contrast among several low-level visual features to represent saliency value. Cheng et al. [9] firstly divide all pixels into different categories according to their color information and then compute different categories’ contrast and spatial information to generate the final saliency map. Achanta et al. [10] utilize statistical information to find regions with high feature difference and define them to be salient regions. Different with the contrast prior, background prior usually selects image boundary regions as initial background seeds, then each image region’s saliency value is defined by the relationship between it and the initial background seeds. e.g., Yang et al. [12] utilize manifold ranking score to represent the feature difference between image regions in the background prior model construction. In [13], each image region is reconstructed by all boundary regions and reconstruction errors refer to its saliency value. In addition, Zhang et al. [14] remove bottom boundary and boundary regions of other directions are employed as background seeds. Li et al. [15] present an effective label propagation mechanism via diffusing saliency values from the image boundary regions to other regions. Besides, center prior is also applied widely to existing works, it usually establishes various functions to give center regions higher saliency values, since human eye always detects the center position instead of surrounding regions in a natural scene. Meanwhile, some works concentrate on applying classic machine leaning algorithms and mathematical theories to bottom-up methods. e.g., Zeng et al. [6] propose a so-called saliency game to detect salient object. Tong et al. [16] utilize multiple kernel functions and features to train multiple weak classifiers and use bootstrap learning to learn a strong classifier which can classify salient object and background. Kong et al. [17] propose a saliency pattern mining mechanism to select robust foreground seeds from an existing saliency map, which can provide a vital indicator for subsequent optimization operator.

Recently, graph-based salient object detection methods are becoming a major branch of bottom-up strategy. These methods usually generate saliency map based on the exploration of the intrinsic relationship between image regions. Overall, graph-based methods follow a unified process: Firstly, an input image is represented as a graph structure with segmented superpixels as nodes which are connected by weighted edges. Most graph-based methods employ two-ring graph structure, in which each node connects to two-layer neighborhood nodes, while the edge weight between connected nodes is the distance metric between their feature vectors. Secondly, each node is given coarse saliency value according to various prior knowledge, then saliency values are diffused among connected nodes based on different propagation mechanisms, such as manifold ranking [12], absorbing markov chain [19], [40], label propagation [15], cellular automata [23], etc. As a result, a refined saliency map can be generated after propagation. After that, Chen et al. [30] attempt to reduce propagation errors by introducing the concept of sink points into ranking-based propagation process. Zhang et al. [11] propose a so-called local structure propagation mechanism, which solves the limitation of cellular automata in some challenging scenes. In addition, Wang et al. [18] propose a novel graph model that considers simultaneously local and global cues to represent the relationship between superpixels.

Nevertheless, such method still suffers from several unsolved problems. Firstly, graph-based methods usually utilize one or several prior knowledge to give nodes initial saliency values, which influence directly subsequent propagation result. But it’s not hard to know that various prior knowledge have their own strengths in difference types of scenes, such as Fig. 1. In other words, we are difficult to find a unified prior adapting all images, fusion errors also will be inevitable even if we simply integrate them. Thus, how to select appropriate prior knowledge for each image is very vital in the first stage of graph-based methods.

Secondly, graph structures used in existing graph-based methods usually aim at assigning similar saliency values to adjacent superpixels with similar features. However, this is based on the assumption that both salient object and background have uniform features while there is a great feature contrast between salient object and background. Therefore, these graph structures lose easily effectiveness in some complex scenes. e.g., (1) Salient object and background share similar features. In this situation, two adjacent superpixels also could cross the object boundary even if they are very similar in features, they should be not assigned to similar saliency values. It means that salient object is hard to be highlighted from background by conventional graph structures when there is low feature contrast between salient object and background, such as the first image in Fig. 2. (2) Salient object (or background) contains diverse types of regions with different appearances. In this situation, both two adjacent superpixels also might belong to the salient object (or background) even if they have a great feature difference, therefore they should be assigned to similar saliency values. It means that salient object cannot be fully highlighted (or background regions cannot be completely suppressed) by conventional graph structures, such as the second image in Fig. 2. In summary, existing graph structures have their own limitations for the exploration of the relationship between superpixels, especially in some complex scenes.

Thirdly, to refine the quality of the coarse saliency map, propagation mechanism needs to be built when we obtain an effective graph structure. However, some complicated background regions are hard to be suppressed by existing propagation mechanism if they have very high saliency values in the initial stage (i.e., prior knowledge labeling stage). As a result, background noises still exist after propagation even if a good graph structure has been built.

To solve above mentioned drawbacks in existing works, we develop a novel graph-based salient object detection method. The framework of the proposed method is shown in Fig. 3. The input image is firstly segmented into N superpixels as basic units in the whole framework. Then, background prior and Objectness prior are integrated to obtain a coarse saliency map. Considering the different performance of two prior knowledge in difference scenes, unlike previous works fusing directly them, we present a novel weighting mechanism to balance the proportion of two prior knowledge in the integration stage, this proportion is varied with image content change. In other words, our method can adjust two prior knowledge’s contribution in the coarse saliency map construction according to the image content change. This operator will generate more accurate coarse saliency map for each image compared to previous works.

Then we attempt to use graph-based propagation mechanism to further refine the quality of the coarse saliency map. To better explore the intrinsic relevance between superpixels, we propose a novel boundary-guided graph model. Different with previous graph structures assigning similar saliency values to adjacent superpixels with similar features, our graph model emphasizes the importance of the object boundary regions in the exploration of the relationship between superpixels. i.e., we aim at enforcing saliency consistency between adjacent superpixels belonging to the same class (‘class’ indicates salient object or background), while ensures that adjacent superpixels crossing the object boundary have great saliency difference. This setting is no doubt to be more effective compared to previous graph structures. To achieve this setting, we propose respectively novel edge set construction and edge weight computation methods. Edge set construction method: Instead of previous connection rule that each superpixel connects to its two-layer neighbors, two new connection rules are proposed. For one thing, high similarity rule constructs a filtering mechanism based on feature similarity to ensure that each superpixel finds neighbors belonging to the same class from all neighbors sharing similar features with it, then connects to them (‘class’ indicates salient object or background), this rule can avoid effectively connections between adjacent superpixels crossing the object boundary when salient object and background share similar features, such as the first image in Fig. 2. For the other, prior guided rule can help each superpixel to find neighbors belonging to the same class from all neighbors which have great feature difference to it, then connect to them. This rule can avoid the loss of some necessary edges when there exist diverse types of regions with different appearances inside the same salient object or background, such as the second image in Fig. 2. In other words, it can ensure that two adjacent superpixels inside the same salient object (or background) are connected even if they are very dissimilar in features. Considering that two new connection rules have their own advantages for different types of superpixels, we also construct an adaptive strategy to help each superpixel to select appropriate connection rules according to its local environment. As a result, we can connect any two adjacent superpixels inside the salient object/background and avoid the connection between adjacent superpixels crossing the object boundary. Edge weight computation method: Based on the above constructed edge set, we also need to compute the edge weight between connected superpixels. Unlike previous works computing directly the Euclidean distance between feature vectors extracted from the connected superpixels to represent the edge weight between them. We deeply consider that various components in feature vector have different contributions in the similarity metric between superpixels. We therefore focus on computing more accurate edge weight between connected superpixels by introducing a statistical evaluation method to adjust adaptively the power of various feature components in similarity metric. As a result, the contrast between salient object and background can be better captured by our edge weight computation strategy. In summary, the proposed boundary guided graph model can better represent the relationship between superpixels than conventional graph structure, especially in some complex scenes, such as Fig. 2.

On the basis of the proposed boundary-guided graph structure, we also propose an iterative propagation mechanism to further boost the quality of the coarse saliency map. In our iterative propagation mechanism, two new propagation matrices containing spatial compactness matrix and background suppression matrix are built to overcome the limitation of existing propagation mechanisms themselves and ensure that complicated background regions are better suppressed and salient objects are also effectively highlighted compared to previous works.

In summary, the contributions of our work are listed as follows:

  • (1)

    We firstly integrate two prior maps to construct an effective coarse saliency map, a novel balance mechanism is proposed to adjust the proportion of two prior maps according to their performance, this proportion is varied with image content change.

  • (2)

    We propose a novel boundary-guided graph model which can better explore the relationship between superpixels via constructing novel edge set construction and edge weight computation methods. The boundary-guided graph structure is also the main contribution in this work.

  • (3)

    Based on the proposed boundary-guided graph, an effective iterative propagation mechanism is developed to further boost the performance of the coarse saliency map.

Section snippets

Proposed method

The framework of our method is shown in Fig. 3. Firstly, background prior and Objectness prior are integrated into a coarse saliency map. A novel weighting mechanism is proposed to balance the proportion of two prior knowledge in the integration stage, as a result, the coarse saliency map can be adaptive to different scenes. We then attempt to utilize graph-based optimization framework to further boost the quality of the coarse saliency map. Here, we construct a novel boundary-guided graph

Experiments

We compare the proposed method with other fourteen state-of-the-art saliency detection methods, including: BSCA [23], LPS [15], MST [38], LDS [24], TLLT [25], MAP [26], MLSP [27], DGLS [14], SMD [28], GraM [41], FSP [42], S-CNN [3], LEGs [2] and HCA [4]. Where BSCA, LPS, TLLT, MAP, MLSP, GraM and FSP are graph based salient object detection methods. S-CNN, LEGs and HCA belong to deep neural network (DNN) based salient object detection methods. Other comparison methods containing MST, LDS, DGLS

Conclusion

This paper presents a novel salient object detection method. Firstly, we integrate two prior knowledge into a coarse saliency map by developing a novel weighting mechanism to balance the power of two initial prior maps. Secondly, an efficacious boundary-guided graph model is employed to better excavate the local relationship between superpixels. Based on this, an effective iterative propagation mechanism is developed to generate a refined saliency map. We compare the proposed method with other

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by the National Key Research and Development Project of China under Grant 2018YFB1404101

References (48)

  • Y. Qin et al.

    Hierarchical cellular automata for visual saliency

    Int. J. Comput. Vis.

    (2018)
  • J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of 2015 IEEE...
  • Y. Zeng et al.

    An unsupervised game-theoretic approach to saliency detection

    IEEE Trans. Image Process

    (2018)
  • K.Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of...
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • M. Cheng et al.

    Global contrast based salient region detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned salient region detection, in: Proceedings of 2009 IEEE...
  • C. Yang, L. Zhang, H. Lu, X. Ruan, M. Yang, Saliency detection via graph-based manifold ranking, in: Proceedings of...
  • X. Li, H. Lu, L. Zhang, X. Ruan, M. Yang, Saliency detection via dense and sparse reconstruction, in: Proceedings...
  • H. Li et al.

    Inner and inter label propagation: salient object detection in the wild

    IEEE Trans. Image Process

    (2015)
  • N. Tong, H. Lu, X. Ruan, M. Yang, Salient object detection via bootstrap learning, in: Proceedings of 2015 IEEE...
  • Y. Kong, L. Wang, X. Liu, H. Lu, X. Ruan, Pattern mining saliency, in: Proceedings of 2016 European 14th Conference on...
  • Q. Wang, W. Zheng, R. Piramuthu, GraB: Visual saliency via novel graph model and background priors, in: Proceedings of...
  • L. Zhang et al.

    Saliency detection via absorbing markov chain with learnt transition probability

    IEEE Trans. Image Process

    (2018)
  • Cited by (0)

    This paper has been recommended for acceptance by ‘Dr Zicheng Liu’.

    View full text