Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding,Displays

当前位置： X-MOL 学术 › Displays › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding
Displays ( IF 3.7 ) Pub Date : 2021-08-21 , DOI: 10.1016/j.displa.2021.102055
Hanqiao Huang ₁ , Yamin Han _{1,

2} , Peng Zhang ₁ , Wei Huang ₃

Affiliation

In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.

中文翻译：

基于尺度估计深度网络的跟踪，具有层次相关性集成，用于跨媒体理解

在不同的基于视觉的跨媒体应用中，通常需要准确定位/跟踪视觉区域内的兴趣对象，以实现更有效的理解和生成图像描述（UGID），例如视听唇形识别。不幸的是，当对象运动处于动态状态时，现实场景中的鲁棒跟踪通常会受到动态外观变化的挑战。最近对用于分类/识别任务的深度神经元网络的研究激发了视觉跟踪的巨大进步，但目标建模过程中尺度不变性的内在假设仍然限制了跟踪性能的进一步改进。在本研究中，受用尺度估计学习对象外观的启发，提出了一种规模估计的深度网络（SEN）来在跟踪过程中预测更准确的对象大小。通过将所提出的 SEN 合并到分层相关集成框架中，实现了联合平移尺度跟踪方案，以同时估计目标对象的位置和尺度。在具有挑战性的基准数据集上进行的大量实验表明，所提出的跟踪器能够取得有竞争力的结果。此外，跟踪嘴唇的性能评估也表明，所提出的工作还能够支持不同类型的跨媒体应用程序中的视听识别任务。完成联合平移尺度跟踪方案以同时估计目标对象的位置和尺度。在具有挑战性的基准数据集上进行的大量实验表明，所提出的跟踪器能够取得有竞争力的结果。此外，跟踪嘴唇的性能评估也表明，所提出的工作还能够支持不同类型的跨媒体应用程序中的视听识别任务。完成联合平移尺度跟踪方案以同时估计目标对象的位置和尺度。在具有挑战性的基准数据集上进行的大量实验表明，所提出的跟踪器能够取得有竞争力的结果。此外，跟踪嘴唇的性能评估也表明，所提出的工作还能够支持不同类型的跨媒体应用程序中的视听识别任务。

更新日期：2021-08-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11