当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Synergic learning for noise-insensitive webly-supervised temporal action localization
Image and Vision Computing ( IF 4.2 ) Pub Date : 2021-07-02 , DOI: 10.1016/j.imavis.2021.104247
Can Zhang 1 , Meng Cao 1 , Dongming Yang 1 , Ji Jiang 1 , Yuexian Zou 1, 2
Affiliation  

Webly-supervised temporal action localization (WebTAL) leverages web videos to train localization models without requiring manual temporal annotations. WebTAL is extremely challenging since video-level labels on the web are always noisy, seriously damaging the overall performance. Most state-of-the-art methods filter out noise before training, which will inevitably reduce the training samples. In contrast, we propose a preprocessing-free WebTAL framework along with a new synergic learning paradigm to alleviate the noise interference. Specifically, we introduce a synergic task called Spatio-Temporal Order Prediction (STOP) for spatio-temporal representation learning. This task requires a network to arrange permuted spatial crops and temporal clips, thereby learning the inherent spatial semantics and temporal interactions in videos. Instead of pre-extracting features with the well-trained STOP, we design a novel synergic learning paradigm called Warm-up Synergic Training (WST) to iteratively generate better spatio-temporal representations and improve action localization results. In this synergic fashion, experimental results show that the interference caused by label noise will be largely mitigated. We demonstrate that our method outperforms all other WebTAL methods on two public benchmarks, THUMOS'14 and ActivityNet v1.2.



中文翻译:

噪声不敏感的网络监督时间动作定位的协同学习

Webly 监督时间动作定位 (WebTAL) 利用网络视频来训练定位模型,而无需手动时间注释。WebTAL 极具挑战性,因为网络上的视频级标签总是嘈杂,严重损害整体性能。大多数最先进的方法在训练前过滤掉噪声,这将不可避免地减少训练样本。相比之下,我们提出了一种无需预处理的 WebTAL 框架以及一种新的协同学习范式,以减轻噪声干扰。具体来说,我们引入了一个称为时空顺序预测 (STOP) 的协同任务,用于时空表征学习。这项任务需要一个网络来排列排列的空间裁剪和时间剪辑,从而学习视频中固有的空间语义和时间交互。我们没有使用训练有素的 STOP 预先提取特征,而是设计了一种称为预热协同训练 (WST) 的新型协同学习范式,以迭代地生成更好的时空表示并改善动作定位结果。以这种协同方式,实验结果表明标签噪声引起的干扰将大大减轻。我们证明了我们的方法在两个公共基准测试 THUMOS'14 和 ActivityNet v1.2 上优于所有其他 WebTAL 方法。

更新日期:2021-07-13
down
wechat
bug