Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network
Pattern Recognition ( IF 7.5 ) Pub Date : 2021-01-01 , DOI: 10.1016/j.patcog.2020.107615
Jin Chen , Huihui Song , Kaihua Zhang , Bo Liu , Qingshan Liu

Due to a variety of motions across different frames, it is highly challenging to learn an effective spatiotemporal representation for accurate video saliency prediction (VSP). To address this issue, we develop an effective spatiotemporal feature alignment network tailored to VSP, mainly including two key sub-networks: a multi-scale deformable convolutional alignment network (MDAN) and a bidirectional convolutional Long Short-Term Memory (Bi-ConvLSTM) network. The MDAN learns to align the features of the neighboring frames to the reference one in a coarse-to-fine manner, which can well handle various motions. Specifically, the MDAN owns a pyramidal feature hierarchy structure that first leverages deformable convolution (Dconv) to align the lower-resolution features across frames, and then aggregates the aligned features to align the higher-resolution features, progressively enhancing the features from top to bottom. The output of MDAN is then fed into the Bi-ConvLSTM for further enhancement, which captures the useful long-time temporal information along forward and backward timing directions to effectively guide attention orientation shift prediction under complex scene transformation. Finally, the enhanced features are decoded to generate the predicted saliency map. The proposed model is trained end-to-end without any intricate post processing. Extensive evaluations on four VSP benchmark datasets demonstrate that the proposed method achieves favorable performance against state-of-the-art methods. The source codes and all the results will be released.

中文翻译：

使用增强时空对齐网络的视频显着性预测

由于跨不同帧的各种运动，学习用于准确视频显着性预测 (VSP) 的有效时空表示非常具有挑战性。为了解决这个问题，我们开发了一个有效的 VSP 时空特征对齐网络，主要包括两个关键子网络：一个多尺度可变形卷积对齐网络（MDAN）和一个双向卷积长短期记忆（Bi-ConvLSTM）网络。MDAN 学习以从粗到细的方式将相邻帧的特征与参考帧的特征对齐，这样可以很好地处理各种运动。具体来说，MDAN 拥有金字塔形特征层次结构，该结构首先利用可变形卷积 (Dconv) 来对齐跨帧的低分辨率特征，然后聚合对齐的特征以对齐更高分辨率的特征，从上到下逐步增强特征。然后将 MDAN 的输出输入 Bi-ConvLSTM 进行进一步增强，它沿前向和后向时间方向捕获有用的长时间时间信息，以有效指导复杂场景变换下的注意力方向转移预测。最后，对增强的特征进行解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。然后将 MDAN 的输出输入 Bi-ConvLSTM 进行进一步增强，它沿前向和后向时间方向捕获有用的长时间时间信息，以有效指导复杂场景变换下的注意力方向转移预测。最后，对增强的特征进行解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。然后将 MDAN 的输出输入 Bi-ConvLSTM 进行进一步增强，它沿前向和后向时间方向捕获有用的长时间时间信息，以有效指导复杂场景变换下的注意力方向转移预测。最后，对增强的特征进行解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。它沿前向和后向时间方向捕获有用的长时间时间信息，以有效指导复杂场景变换下的注意力方向转移预测。最后，对增强的特征进行解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。它沿前向和后向时间方向捕获有用的长时间时间信息，以有效指导复杂场景变换下的注意力方向转移预测。最后，对增强的特征进行解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。对四个 VSP 基准数据集的广泛评估表明，所提出的方法相对于最先进的方法取得了良好的性能。将发布源代码和所有结果。

更新日期：2021-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11