Transformer-based Cross Reference Network for video salient object detection,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformer-based Cross Reference Network for video salient object detection
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2022-06-15 , DOI: 10.1016/j.patrec.2022.06.006
Kan Huang , Chunwei Tian , Jingyong Su , Jerry Chun-Wei Lin

Video salient object detection is a fundamental computer vision task aimed at highlighting the most conspicuous objects in a video sequence. There are two key challenges presented in video salient object detection: (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. To handle these challenges, in this paper, we propose a novel Transformer-based Cross Reference Network (TCRN), which fully exploits long-range context dependencies in both feature representation extraction and cross-modal (i.e., appearance and motion) integration. In contrast to existing CNN-based methods, our approach formulates video salient object detection as a sequence-to-sequence prediction task. In the proposed approach, the deep feature extraction is achieved by a pure vision transformer with multi-resolution token representations. Specifically, we design a Gated Cross Reference (GCR) module to effectively integrate appearance and motion into saliency representation. The GCR first propagates global context information between different modalities, and then perform cross-modal fusion by a gate mechanism. Extensive evaluations on five widely-used benchmarks show that the proposed Transformer-based method performs favorably against the existing state-of-the-art methods

中文翻译：

用于视频显着目标检测的基于 Transformer 的交叉参考网络

视频显着对象检测是一项基本的计算机视觉任务，旨在突出视频序列中最显眼的对象。视频显着性目标检测存在两个关键挑战：（1）如何从外观和运动线索中提取有效的特征表示，以及（2）如何将两者结合成鲁棒的显着性表示。为了应对这些挑战，在本文中，我们提出了一种新颖的基于 Transformer 的交叉参考网络（TCRN），它充分利用了特征表示提取和跨模态（即外观和运动）集成中的远程上下文依赖性。与现有的基于 CNN 的方法相比，我们的方法将视频显着对象检测制定为序列到序列的预测任务。在建议的方法中，深度特征提取是通过具有多分辨率令牌表示的纯视觉转换器实现的。具体来说，我们设计了一个门控交叉参考（GCR）模块，以有效地将外观和运动整合到显着性表示中。GCR首先在不同模态之间传播全局上下文信息，然后通过门机制进行跨模态融合。对五个广泛使用的基准的广泛评估表明，所提出的基于 Transformer 的方法优于现有的最先进方法然后通过门机制进行跨模态融合。对五个广泛使用的基准的广泛评估表明，所提出的基于 Transformer 的方法优于现有的最先进方法然后通过门机制进行跨模态融合。对五个广泛使用的基准的广泛评估表明，所提出的基于 Transformer 的方法优于现有的最先进方法

更新日期：2022-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11