Elsevier

Neurocomputing

Volume 411, 21 October 2020, Pages 416-427
Neurocomputing

Multi-attention guided feature fusion network for salient object detection

https://doi.org/10.1016/j.neucom.2020.06.021Get rights and content

Abstract

Though with the rapid development of deep learning, salient object detection methods have achieved increasingly better performance, how to get effective feature representation to predict more accurate saliency maps is still a burning problem we need to consider. To overcome this situation, most previous works tend to focus on skip-based architecture to integrate hierarchical information of different scales and layers. However, a simple concatenation of high-level features and low-level features is not all-powerful because cluttered and noisy information can cause negative consequences. Concerning the issue mentioned above, we propose a Multi-Attention guided Feature-fusion network (MAF) which can alleviate the problem from two aspects. For one thing, we use a novel Channel-wise Attention Block (CAB) to in charge of message passing layer by layer from a global view, which utilizes the semantic cues in the higher convolutional block to instruct the feature selection in the lower block. For another, a Position Attention Block (PAB) also works on integrated features to model pixel relationships and capture rich contextual dependencies. Under the guidance of multi-attention, discriminative features are selected to conduct a new end-to-end densely supervised encoder-decoder network which detects salient objects more uniformly and precisely. As the experimental results on five benchmark datasets show, our methods perform favorably against other state-of-the-art approaches.

Introduction

Saliency detection refers to extracting salient regions from images with intelligent algorithms that simulate the human visual system. This task is divided into two branches: eye-fixation detection [[1], [2], [3], [4], [5]] and salient object segmentation [6], [7], [8], [9], [10], [11], [12], [13], [14]. In this paper, we focus on the latter issue with the purpose of separating salient object areas from input images. The results of this research usually serve as a pre-processing step in varied computer vision tasks, such as video segmentation [15], visual tracking [16], image retrieval [17], thumbnail creation [18] and image captioning [19].

Numerous methods have been emerging unceasingly in the past few decades, due to the importance of salient object detection. Conventional models [20], [21], which were deeply influenced by the algorithm proposed by Itti et al. [22], usually utilize hand-crafted features to calculate contrast between local and global regions. However, it is obviously of great difficulty to use these simple low-level features, such as color and intensity, to segment the salient object from complex scenarios.

Recently, substantial contribution has been made in computer vision with the introduction of Convolutional Neural Networks (CNNs, e.g. VGG [25] and ResNet [26]). The CNN-based methods which can intelligently extract complex features with high-level semantic cues and low-level spatial structures synchronously are more feasible and effective than traditional algorithms. Even so, the repeated pooling operations in CNNs inevitably cause the loss of spatial details, which can’t be recovered by upsample operation and make a negative impact on dense prediction tasks. To address the above problem, multi-scale feature aggregation mechanisms [23], [24] have been used to enhance detailed information and capture distinctive objectness. However, the result of simple skip and short connections is not quite satisfactory(see Fig. 1) because different features have different impact on predicting salient pixels. In fact, some cluttered and noisy features may result in interference.

Therefore, when considering optimal and robust fusion-feature representation to get a more precise prediction, we hope the network has the faculty to select discriminative features and abandon noisy ones automatically. As a result, the attention mechanism [27], [28], which generates weights to image features of varied positions and channels, has been put forward and benefits different computer vision tasks [29], [30], [31], [32] a lot. On account of the superiority of attention, we apply multiple attention mechanisms to guide the message passing block by block in this paper. Different from the work [33] proposed in 2018, we use a novel Channel-wise Attention Block(CAB), which takes charge of the information transmission between every two contiguous blocks to learn a more satisfying aggregated features. Besides, we also employ self-attention and spatial-attention to improve the integrated features in the spatial dimension.

More specifically, our motivation is to solve two challenging problems for salient object detection via attention mechanisms. The first one is how to preserve the spatial consistency of the salient object. As is shown in the first row of Fig. 1, inconsistency within the scope of salient area troubles many saliency methods, which may miss parts of entire goals. To tackle the issue, we construct a CAB-based encoder-decoder network to learn a more robust fusion feature representation due to two factors. For one thing, we concat features output from every two adjacent convolutional blocks in the CAB module, then employ the semantic information of the higher block to calculate the channel-wise weights of the lower block from a global perspective. Accordingly, the semantic cues in the deeper block can guide the shallower block to select more discriminative features, which strengthen the capacity to segment the whole object. For another, the inconsistency problem is also caused by the lack of sufficient context information, so we integrate multi-scale features in the decoder subnet to capture the global and local context.

The second problem is how to prevent the network from predicting the redundant background area to be salient area(see the second row of Fig. 1). This issue mainly results from cluttered background features and the lack of contrast context information. To alleviate the problem, we design a Position Attention Block (PAB) which is composed of a self-attention module and a spatial attention module. Firstly, the self-attention module aims to get pixel relationships between every two-pixel pairs. For feature vector at any spatial position, we calculate similarities between itself and all other ones. The result is used to weight every feature vector in all spatial locations, and then the sum of weighted feature vectors will update the feature vector in the primary position. As a result, similar feature vectors contribute mutual improvement irrespective of their distance in the feature map so that the model can capture long-range dependency and contextual information. Secondly, we apply the spatial attention module to highlight the salient areas and suppress background positions. It is evident that the spatial attention module can avoid distractions of non-salient regions and make features more distinctive because not all feature vectors contribute to saliency detection and the noisy features of background regions may generate interference.

In conclusion, the feature fusion network we proposed in this paper performs superiorly under the guidance of the multi-attention mechanism. Our contributions are summarized as three folds:

  • We propose an encoder-decoder feature aggregation network with a novel channel-wise attention block, which utilizes features in the high-level block to guide the selection of features in the low-level block. The multi-scale fusion features are of great benefit to the spatial consistency of the salient object.

  • We also use self-attention and spatial attention to capture long-range contextual information and make features more distinctive and effective.

  • We test the model on five saliency benchmark datasets, and the results of the experiment validate the effectiveness of our proposed algorithm.

Section snippets

Related work

As a vital branch of dense prediction tasks, saliency detection has developed rapidly in recent decades. Early researches [34], [35], [36], [37], [38], [39], [40], [41], [42] concentrate on extracting hand-crafted features, such as color, intensity and some prior information. These methods limited by imperfection of low-level visual features and knowledge of designers have poor accuracy and generalization. Due to the efficiency of deep learning approaches in computer vision tasks [43], [44],

Proposed method

In this section, we dwell on the proposed network for the saliency task. First, we describe the backbone of the architecture. Then, the channel-wise attention guided multi-scale feature fusion mechanism is the point of our narrative. Finally, we present the Position Attention Block (PAB) composed of a spatial attention module and a self-attention module, which filters features in the spatial dimension. As is shown in Fig. 2, there are six side output predictions in the whole network. We concat

Evaluation datasets

We evaluate the proposed network on five popular benchmark datasets: ECSSD [36], DUT-OMRON [35], HKU-IS [48], DUTS-test [66], SOD [67]. The ECSSD dataset has 1000 natural images with pixel-level annotations, and the images are selected from the internet. The DUT-OMRON dataset has 5168 complicated images with accurate ground truth, which is very challenging. The HKU-IS dataset has 4447 images which usually contain multiple disconnected salient objects. The DUTS dataset is a large-scale dataset

Conclusion

In this paper, we propose a novel feature fusion network for saliency detection task using three kinds of attention mechanisms to guide the integration and selection of features. For enhancing the spatial consistency of salient object areas, we introduce a novel CAB module that exploits the semantic cues in the high-level block to guide the feature selection in the low-level block from a global view. Then we utilize spatial attention and self-attention to generate the position attention module,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Anni Li: Writing - Original Draft. JinQing Qi: Writing - Review & Editing. Huchuan Lu: Supervision.

Anni Li received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2017. She is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). Her research interest include saliency detection and semantic segmentation.

References (72)

  • W. Wang et al.

    Deep visual attention prediction

    IEEE Trans. Image Process.

    (2017)
  • X. Huang et al.

    Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks

  • J. Pan et al.

    Shallow and deep convolutional networks for saliency prediction

  • M. Kummerer et al.

    Understanding low-and high-level contributions to fixation prediction

  • M. Cornia et al.

    Predicting human eye fixations via an lstm-based saliency attentive model

    IEEE Trans. Image Process.

    (2018)
  • J. Kim et al.

    A shape-based approach for salient object detection using deep learning

  • S.S. Kruthiventi et al.

    Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation

  • J. Han et al.

    Background prior-based salient object detection via deep reconstruction residual

    IEEE Trans. Circ. Syst. Video Technol.

    (2014)
  • T. Wang et al.

    Detect globally, refine locally: A novel approach to saliency detection

  • R. Quan et al.

    Unsupervised salient object detection via inferring from imperfect saliency models

    IEEE Trans. Multimedia

    (2017)
  • W. Wang et al.

    Correspondence driven saliency transfer

    IEEE Trans. Image Process.

    (2016)
  • J. Han et al.

    Advanced deep-learning techniques for salient and category-specific object detection: a survey

    IEEE Signal Process. Mag.

    (2018)
  • T.V. Nguyen et al.

    As-similar-as-possible saliency fusion

    Multimedia Tools Appl.

    (2017)
  • T.V. Nguyen et al.

    Semantic prior analysis for salient object detection

    IEEE Trans. Image Process.

    (2019)
  • W. Wang et al.

    Saliency-aware video object segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • S. Hong et al.

    Online tracking by learning discriminative saliency map with convolutional neural network

  • Y. Gao et al.

    3-d object retrieval and recognition with hypergraph analysis

    IEEE Trans. Image Process.

    (2012)
  • W. Wang et al.

    Stereoscopic thumbnail creation via efficient stereo saliency detection

    IEEE Trans. Visual. Comput. Graph.

    (2016)
  • H. Fang et al.

    From captions to visual concepts and back

  • M.-M. Cheng et al.

    Global contrast based salient region detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • J. Han et al.

    Unsupervised extraction of visual attention objects in color images

    IEEE Trans. Circ. Syst. Video Technol.

    (2005)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • P. Zhang et al.

    Amulet: Aggregating multi-level convolutional features for salient object detection

  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint...
  • K. He et al.

    Deep residual learning for image recognition

  • V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention, in: Advances in Neural Information...
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you...
  • M. Ren et al.

    End-to-end instance segmentation with recurrent attention

  • X. Chu et al.

    Multi-context attention for human pose estimation

  • J. Lu et al.

    Knowing when to look: adaptive attention via a visual sentinel for image captioning

  • D. Yu et al.

    Multi-level attention networks for visual question answering

  • X. Zhang et al.

    Progressive attention guided recurrent network for salient object detection

  • H. Jiang et al.

    Salient object detection: A discriminative regional feature integration approach

  • C. Yang et al.

    Saliency detection via graph-based manifold ranking

  • Q. Yan et al.

    Hierarchical saliency detection

  • Cited by (0)

    Anni Li received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2017. She is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). Her research interest include saliency detection and semantic segmentation.

    Jinqing Qi received the Ph.D. degree in communication and integrated system from the University of Tokyo Institute of Technology, Tokyo, Japan, in 2004. He is currently an Associate Professor of Information and Communication Engineering at University of DUT, Dalian, China. His recent research interests focus on computer vision, pattern recognition and machine learning. He is a member of IEEE.

    Huchuan Lu received the M.S. degree from the Department of Electrical Engineering, Dalian University of Technology (DUT), China in 1998 and his Ph.D. degree of System Engineering from DUT in 2008, respectively. From 1998 to now, he is a faculty of School of Electronic and Information Engineering of DUT. He has been associate professor since2006. He has visited Ritsumeikan University from Oct. 2007 to Jan. 2008. His recent research interests focus on computer vision, artificial intelligence, pattern recognition and machine learning. He is a member of IEEE and IEIC.

    View full text