Multiple instance deep learning for weakly-supervised visual object tracking,Signal Processing: Image Communication

当前位置： X-MOL 学术 › Signal Process. Image Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multiple instance deep learning for weakly-supervised visual object tracking
Signal Processing: Image Communication ( IF 3.4 ) Pub Date : 2020-02-08 , DOI: 10.1016/j.image.2020.115807
Kaining Huang , Yan Shi , Fuqi Zhao , Zijun Zhang , Shanshan Tu

Intelligently tracking objects with varied shapes, color, lighting conditions, and backgrounds is an extremely useful application in many HCI applications, such as human body motion capture, hand gesture recognition, and virtual reality (VR) games. However, accurately tracking different objects under uncontrolled environments is a tough challenge due to the possibly dynamic object parts, varied lighting conditions, and sophisticated backgrounds. In this work, we propose a novel semantically-aware object tracking framework, wherein the key is weakly-supervised learning paradigm that optimally transfers the video-level semantic tags into various regions. More specifically, give a set of training video clips, each of which is associated with multiple video-level semantic tags, we first propose a weakly-supervised learning algorithm to transfer the semantic tags into various video regions. The key is a MIL (Zhong et al., 2020) [1]-based manifold embedding algorithm that maps the entire video regions into a semantic space, wherein the video-level semantic tags are well encoded. Afterward, for each video region, we use the semantic feature combined with the appearance feature as its representation. We designed a multi-view learning algorithm to optimally fuse the above two types of features. Based on the fused feature, we learn a probabilistic Gaussian mixture model to predict the target probability of each candidate window, where the window with the maximal probability is output as the tracking result. Comprehensive comparative results on a challenging pedestrian tracking task as well as the human hand gesture recognition have demonstrated the effectiveness of our method. Moreover, visualized tracking results have shown that non-rigid objects with moderate occlusions can be well localized by our method.

中文翻译：

多实例深度学习用于弱监督视觉对象跟踪

在许多HCI应用程序中，智能跟踪具有各种形状，颜色，光照条件和背景的对象是非常有用的应用程序，例如人体运动捕捉，手势识别和虚拟现实（VR）游戏。然而，由于可能动态的物体部分，变化的照明条件和复杂的背景，在不受控制的环境下准确跟踪不同的物体是一个艰巨的挑战。在这项工作中，我们提出了一种新颖的语义感知对象跟踪框架，其中的关键是弱监督学习范例，该范例可以将视频级语义标签最佳地转移到各个区域。更具体地说，给出一组训练视频剪辑，每个剪辑都与多个视频级语义标签相关联，我们首先提出一种弱监督学习算法，将语义标签转移到各个视频区域。关键是基于MIL（Zhong et al。，2020）[1]的流形嵌入算法，该算法将整个视频区域映射到语义空间，其中视频级语义标签得到了很好的编码。然后，对于每个视频区域，我们将语义特征与外观特征结合起来用作其表示。我们设计了一种多视图学习算法，以最佳地融合以上两种类型的功能。基于融合特征，我们学习了一个概率高斯混合模型来预测每个候选窗口的目标概率，其中以最大概率的窗口作为跟踪结果输出。在具有挑战性的行人跟踪任务以及人的手势识别方面的综合比较结果证明了我们方法的有效性。此外，可视化跟踪结果表明，使用我们的方法可以很好地定位具有中等遮挡的非刚性物体。

更新日期：2020-03-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文