Spreading Fine-Grained Prior Knowledge for Accurate Tracking,IEEE Transactions on Circuits and Systems for Video Technology

当前位置： X-MOL 学术 › IEEE Trans. Circ. Syst. Video Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spreading Fine-Grained Prior Knowledge for Accurate Tracking
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.4 ) Pub Date : 2022-03-28 , DOI: 10.1109/tcsvt.2022.3162599
Jiahao Nie ₁ , Han Wu ₁ , Zhiwei He ₁ , Mingyu Gao ₁ , Zhekang Dong ₂

Affiliation

With the widespread use of deep learning in single object tracking task, mainstream tracking algorithms treat tracking as a combined classification and regression problem. Classification aims at locating an arbitrary target, and regression aims at estimating the corresponding bounding box. In this paper, we focus on regression and propose a novel box estimation network, which consists of a transformer encoder target pyramid guide (TPG) and transformer decoder target pyramid spread (TPS). Specifically, the transformer encoder TPG is designed to generate fine-grained prior knowledge with explicit representation for template targets. In contrast to the raw transformer encoder, we capture the visual dependence through local-global self-attention and deem the multi-scale target regions as the “local” region. Using this fine-grained prior knowledge, we design the transformer decoder TPS to spread it to the subsequent search regions with high affinity to accurately estimate the bounding boxes. Considering that self-attention fails to model information interaction across channels between the template target and search regions, we develop a channel-wise cross-attention block within the TPS as compensation. Extensive experiments on the OTB100, UAV123, NFS, VOT2020, VOT2021, LaSOT, LaSOT_ext, TrackingNet and GOT-10k benchmarks show that the proposed box estimation network outperforms most existing box estimation methods. Furthermore, our trackers based on this estimation network exhibit a competitive performance against state-of-the-art trackers.

中文翻译：

传播细粒度的先验知识以实现准确跟踪

随着深度学习在单个目标跟踪任务中的广泛应用，主流跟踪算法将跟踪视为一个组合的分类和回归问题。分类旨在定位任意目标，回归旨在估计相应的边界框。在本文中，我们专注于回归并提出了一种新颖的框估计网络，该网络由变压器编码器目标金字塔指南（TPG）和变压器解码器目标金字塔扩展（TPS）组成。具体来说，transformer 编码器 TPG 旨在生成具有模板目标显式表示的细粒度先验知识。与原始转换器编码器相比，我们通过局部-全局自注意捕获视觉依赖性，并将多尺度目标区域视为“局部”区域。使用这种细粒度的先验知识，我们设计了transformer decoder TPS，将其传播到具有高亲和力的后续搜索区域，以准确估计边界框。考虑到 self-attention 无法对模板目标和搜索区域之间的跨通道信息交互进行建模，我们在 TPS 内开发了一个 channel-wise cross-attention 块作为补偿。在 OTB100、UAV123、NFS、VOT2020、VOT2021、LaSOT、LaSOT_ext、TrackingNet 和 GOT-10k 基准上进行的大量实验表明，所提出的框估计网络优于大多数现有的框估计方法。此外，我们基于此估计网络的跟踪器与最先进的跟踪器相比表现出具有竞争力的性能。考虑到 self-attention 无法对模板目标和搜索区域之间的跨通道信息交互进行建模，我们在 TPS 内开发了一个 channel-wise cross-attention 块作为补偿。在 OTB100、UAV123、NFS、VOT2020、VOT2021、LaSOT、LaSOT_ext、TrackingNet 和 GOT-10k 基准上进行的大量实验表明，所提出的框估计网络优于大多数现有的框估计方法。此外，我们基于此估计网络的跟踪器与最先进的跟踪器相比表现出具有竞争力的性能。考虑到 self-attention 无法对模板目标和搜索区域之间的跨通道信息交互进行建模，我们在 TPS 内开发了一个 channel-wise cross-attention 块作为补偿。在 OTB100、UAV123、NFS、VOT2020、VOT2021、LaSOT、LaSOT_ext、TrackingNet 和 GOT-10k 基准上进行的大量实验表明，所提出的框估计网络优于大多数现有的框估计方法。此外，我们基于此估计网络的跟踪器与最先进的跟踪器相比表现出具有竞争力的性能。TrackingNet 和 GOT-10k 基准测试表明，所提出的框估计网络优于大多数现有的框估计方法。此外，我们基于此估计网络的跟踪器与最先进的跟踪器相比表现出具有竞争力的性能。TrackingNet 和 GOT-10k 基准测试表明，所提出的框估计网络优于大多数现有的框估计方法。此外，我们基于此估计网络的跟踪器与最先进的跟踪器相比表现出具有竞争力的性能。

更新日期：2022-03-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>