Contextual encoder-decoder network for visual saliency prediction.,Neural Networks

当前位置： X-MOL 学术 › Neural Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Contextual encoder-decoder network for visual saliency prediction.
Neural Networks ( IF 7.8 ) Pub Date : 2020-05-08 , DOI: 10.1016/j.neunet.2020.05.004
Alexander Kroner ₁ , Mario Senden ₁ , Kurt Driessens ₂ , Rainer Goebel ₃

Affiliation

Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder–decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes. Our TensorFlow implementation is openly available at https://github.com/alexanderkroner/saliency.

中文翻译：

用于视觉显着性预测的上下文编码器/解码器网络。

预测自然图像中的显着区域需要检测场景中存在的对象。为了开发出具有挑战性的任务的可靠表示，必须提取多个空间尺度上的高级视觉特征，并使用上下文信息进行增强。然而，旨在解释人类注视图的现有模型并未明确纳入这种机制。在这里，我们提出一种基于卷积神经网络的方法，该方法在大规模图像分类任务上进行了预训练。该体系结构构成了编码器-解码器结构，并包括具有多个卷积层的模块，这些卷积层具有不同的膨胀率，以并行捕获多尺度特征。此外，我们将结果表示与全局场景信息相结合，以准确预测视觉显着性。我们的模型在两个公共显着性基准上的多个评估指标上均获得了竞争性和一致的结果，并且我们在五个数据集和选定的示例上证明了该建议方法的有效性。与最先进的方法相比，该网络基于轻量级的图像分类主干，因此为计算资源有限的应用（例如（虚拟）机器人系统）提供了一个合适的选择，以估计复杂自然场景中的人类注视。我们的TensorFlow实现可在https://github.com/alexanderkroner/saliency上公开获得。该网络基于轻量级的图像分类主干，因此为计算资源有限的应用（例如（虚拟）机器人系统）提供了一个合适的选择，以估算复杂自然场景中的人类注视。我们的TensorFlow实现可在https://github.com/alexanderkroner/saliency上公开获得。该网络基于轻量级的图像分类主干，因此为计算资源有限的应用（例如（虚拟）机器人系统）提供了合适的选择，以估算复杂自然场景中的人类注视。我们的TensorFlow实现可在https://github.com/alexanderkroner/saliency上公开获得。

更新日期：2020-05-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>