当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Spatiotemporal module for video saliency prediction based on self-attention
Image and Vision Computing ( IF 4.2 ) Pub Date : 2021-05-20 , DOI: 10.1016/j.imavis.2021.104216
Yuhao Wang , Zhuoran Liu , Yibo Xia , Chunbo Zhu , Danpei Zhao

Considering that the existing video saliency prediction methods still have limitations in spatiotemporal correlation learning between features and saliency regions, this paper proposes a spatiotemporal module for video saliency prediction based on self-attention. The proposed model emphasizes three essential problems as follows. First, we raise a multi-scale feature-fusion network (MFN) for effective feature integration. The framework can extract and fuse features from four scales at low memory cost. Second, we view the task as a global evaluation of the correlation on pixel level to predict human visual attention in task-driven scenes more accurately. An adapted transformer encoder is designed for spatiotemporal correlation learning. Finally, we introduce DConvLSTM to learn the context in videos. Experimental results show that the proposed model achieves state-of-the-art performance on both driving scenes and natural scenes with multi-motion information. And our model also achieves very comparable performance especially in natural scenes with multi-category objects. It proves our method is practicable in both data-driven and task-driven conditions.



中文翻译:

基于self-attention的视频显着性预测时空模块

考虑到现有的视频显着性预测方法在特征和显着性区域之间的时空相关性学习方面仍然存在局限性,本文提出了一种基于自注意力的视频显着性预测时空模块。所提出的模型强调了以下三个基本问题。首先,我们提出了一个多尺度特征融合网络(MFN)以进行有效的特征集成。该框架可以以较低的内存成本从四个尺度中提取和融合特征。其次,我们将任务视为像素级相关性的全局评估,以更准确地预测任务驱动场景中的人类视觉注意力。一个适配的变压器编码器是为时空相关学习而设计的。最后,我们引入 DConvLSTM 来学习视频中的上下文。实验结果表明,所提出的模型在具有多运动信息的驾驶场景和自然场景上均达到了最先进的性能。而且我们的模型也实现了非常可比的性能,尤其是在具有多类别对象的自然场景中。它证明了我们的方法在数据驱动和任务驱动的条件下都是可行的。

更新日期:2021-06-01
down
wechat
bug