DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2020-08-28 , DOI: 10.1007/s11263-020-01371-6
Lai Jiang , Mai Xu , Zulin Wang , Leonid Sigal

Deep neural networks (DNNs) have exhibited great success in image saliency prediction. However, few works apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method, called DeepVS2.0. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of LEDOV, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) in DeepVS2.0 to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that human attention has a temporal correlation with a smooth saliency transition across video frames. Therefore, a saliency-structured convolutional long short-term memory network (SS-ConvLSTM) is developed in DeepVS2.0 to predict inter-frame saliency, using the extracted features of OM-CNN as the input. Moreover, the center-bias dropout and sparsity-weighted loss are embedded in SS-ConvLSTM, to consider the center-bias and sparsity of human attention maps. Finally, the experimental results show that our DeepVS2.0 method advances the state-of-the-art video saliency prediction.

中文翻译：

DeepVS2.0：一种预测动态视觉注意力的显着结构深度学习方法

深度神经网络 (DNN) 在图像显着性预测方面取得了巨大成功。然而，很少有工作应用 DNN 来预测通用视频的显着性。在本文中，我们提出了一种新的基于 DNN 的视频显着性预测方法，称为 DeepVS2.0。具体来说，我们建立了一个大规模的视频眼动追踪数据库（LEDOV），它提供了足够的数据来训练 DNN 模型来预测视频显着性。通过 LEDOV 的统计分析，我们发现人类的注意力通常被物体吸引，特别是运动物体或物体运动的部分。因此，我们在 DeepVS2.0 中提出了一个对象到运动的卷积神经网络 (OM-CNN)，通过探索对象性和对象运动的信息来学习用于预测帧内显着性的时空特征。我们进一步从我们的数据库中发现，人类注意力与跨视频帧的平滑显着性过渡具有时间相关性。因此，在 DeepVS2.0 中开发了一个显着结构的卷积长短期记忆网络（SS-ConvLSTM）来预测帧间显着性，使用 OM-CNN 的提取特征作为输入。此外，中心偏差 dropout 和稀疏加权损失被嵌入到 SS-ConvLSTM 中，以考虑人类注意力图的中心偏差和稀疏性。最后，实验结果表明，我们的 DeepVS2.0 方法推进了最先进的视频显着性预测。0 预测帧间显着性，使用 OM-CNN 的提取特征作为输入。此外，中心偏差 dropout 和稀疏加权损失被嵌入到 SS-ConvLSTM 中，以考虑人类注意力图的中心偏差和稀疏性。最后，实验结果表明，我们的 DeepVS2.0 方法推进了最先进的视频显着性预测。0 预测帧间显着性，使用 OM-CNN 的提取特征作为输入。此外，中心偏差 dropout 和稀疏加权损失被嵌入到 SS-ConvLSTM 中，以考虑人类注意力图的中心偏差和稀疏性。最后，实验结果表明，我们的 DeepVS2.0 方法推进了最先进的视频显着性预测。

更新日期：2020-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11