Synergic learning for noise-insensitive webly-supervised temporal action localization

doi:10.1016/j.imavis.2021.104247

Image and Vision Computing

Volume 113, September 2021, 104247

https://doi.org/10.1016/j.imavis.2021.104247 Get rights and content

Highlights

•
We propose a framework for Webly-supervised Temporal Action Localization (WebTAL).
•
We introduce a synergic task called STOP for spatio-temporal feature learning.
•
WST is proposed to iteratively generate better features and improve TAL results.
•
Experiments on benchmarks show that our method achieves state-of-the-art results.

Abstract

Webly-supervised temporal action localization (WebTAL) leverages web videos to train localization models without requiring manual temporal annotations. WebTAL is extremely challenging since video-level labels on the web are always noisy, seriously damaging the overall performance. Most state-of-the-art methods filter out noise before training, which will inevitably reduce the training samples. In contrast, we propose a preprocessing-free WebTAL framework along with a new synergic learning paradigm to alleviate the noise interference. Specifically, we introduce a synergic task called Spatio-Temporal Order Prediction (STOP) for spatio-temporal representation learning. This task requires a network to arrange permuted spatial crops and temporal clips, thereby learning the inherent spatial semantics and temporal interactions in videos. Instead of pre-extracting features with the well-trained STOP, we design a novel synergic learning paradigm called Warm-up Synergic Training (WST) to iteratively generate better spatio-temporal representations and improve action localization results. In this synergic fashion, experimental results show that the interference caused by label noise will be largely mitigated. We demonstrate that our method outperforms all other WebTAL methods on two public benchmarks, THUMOS'14 and ActivityNet v1.2.

Introduction

Temporal action localization (TAL) aims at detecting the start and end time of action instances, as well as predicting their class labels, from untrimmed videos. It has been becoming a research hotspot in the multimedia community, owing to its wide applications in surveillance analysis, video summarization, and many other areas [1], [2], [3]. Most existing TAL methods [4], [5], [6], [7] rely on the strong supervision, i.e., ground-truth temporal extent of action instances, from fine-labeled benchmarks [8], [9]. Obviously, this manual labeling procedure is labor-intensive and time-consuming. One feasible solution to reduce the requirement of strong supervision is the weakly-supervised temporal action localization (WTAL), where only video-level labels are required during training. However, when facing the large amount of videos, video-level annotations still consume tremendous manpower, especially when the action category increases.

In contrast, along with the increase of network bandwidth, there has been an explosive growth of web videos. For example, over 10⁵ hours of videos are uploaded to YouTube every day and these videos often come with user-provided tags. It is easy to access abundant videos by querying keywords from search engines, thus learning action localizers from such rich resources is more labor-saving, which is referred to as webly-supervised temporal action localization (WebTAL). Compared with fully-supervised (FTAL) and weakly-supervised (WTAL) settings, WebTAL is more scalable since it does not need any human labeling effort and thus the quantity of the training videos could be scaled up greatly.

However, as shown in Fig. 2, utilizing the raw web videos will face a common detriment — Label Noise. The videos retrieved from web search engines are generally noisy, containing either wrong labels or massive irrelevant (background) frames. Worse still, web videos are untrimmed and may last from minutes to hours, thus the label noise in long videos will seriously confuse the model during training and damage the final performance.

Limited works have been devoted to addressing this challenge in WebTAL [10], [11], [12]. Sun et al. [10] utilize a domain transfer scheme between images and videos to filter out the noise in the other domain. Sultani et al. [11] use a random walk strategy to dampen image noise for webly-supervised spatio-temporal action localization. Rupprecht et al. [12] propose an independent filtering approach that does not rely on the training set for removing web data outliers. Although several filtering strategies are taken in these works, they are either manually-designed or introduce bias for the following localization task [12]. Besides, filtering out noisy samples before training will inevitably reduce the number of training samples.

Instead, in this paper, we delicately design a preprocessing-free noise-insensitive WebTAL framework along with a novel synergic learning paradigm. The proposed framework consists of two prediction branches, namely Spatio-Temporal Order Prediction (STOP) branch and Action Class Prediction (ACP) branch, focusing on spatio-temporal representation learning and action localization, respectively.

For the STOP branch, we devise to explore the self-contained video data structure to avoid label noise interference. Our motivation stems from the observation that regardless of whether the labels are correct or not, web videos are always meaningful in spatial appearance and temporal evolution. Namely, the inherent spatial and temporal order information is worth exploring. Thus we introduce a synergic task called STOP, which works in a self-supervised manner to force the network to understand both spatial semantics and temporal relations in the video. As shown in Fig. 1, given several clips from a video, we first spatially divide these video clips into equal-sized tiles, then randomly swap some tiles and shuffle the clip order. The network is encouraged to predict the correct spatial arrangement and temporal order, so as to learn noise-insensitive spatio-temporal representations.

For the ACP branch, since WebTAL is similar to WTAL in that only video-level labels are available during training, we follow the commonly adopted Multiple Instance Learning (MIL) structure in WTAL [13], [14], [15], [16], where a video is treated as a bag of clips to perform the video-level action class prediction (ACP). Specifically, our ACP branch first takes clip-wise features as input and produces the temporal class activation sequence (T-CAS) to represent class scores for each clip. Then these clip-level scores are aggregated to predict the video's action classes so that video-level classification loss can guide the clip-level predictions. Note that different from those approaches that the features are pre-extracted by a well-trained extractor (e.g. I3D pre-trained on Kinetics [17]), we generate features on the fly using the same feature extractor of our STOP branch with shared weights. The advantage of this is to ensure that the extracted features are not seriously disturbed by label noise, and the category information may also in turn benefit the feature learning process, which will be verified in our experiments.

Furthermore, to train the whole framework, we propose a novel Warm-up Synergic Training (WST) scheme, which optimizes these two branches alternately. In self-supervised learning, one common way to use STOP is to take it as a pretext task to provide better initialization parameters for the downstream ACP. In contrast, we train STOP and ACP in a synergic way. Specifically, the input of STOP depends on the T-CAS results of ACP, i.e., the clips that are likely to contain actions are selected accordingly. It serves as an essential feedback mechanism to avoid feature learning being disturbed by non-action clips and will finally facilitate temporal localization. By alternately training STOP with the basic ACP branch, the framework can prevent the video representation learning from being largely affected by noisy labels. Besides, the additional introduction of STOP can yield significant benefits in terms of video representation ability and localization performance, which will be verified through extensive experiments.

In a nutshell, the main contributions of our paper are as follows:

•
We propose a novel synergic learning scheme called WST to address WebTAL, which alternately generates better spatio-temporal representations and improves action localization results in a synergic fashion. Moreover, the interference caused by label noise is largely mitigated.
•
We introduce a self-supervised synergic task called STOP into WebTAL for spatio-temporal representation learning. STOP encourages a model to focus on the class-agnostic spatial semantics and temporal interactions, thereby learning label noise-insensitive representations.
•
We achieve state-of-the-art results on THUMOS'14 [8], and are the first to report WebTAL performance on ActivityNet v1.2 [9] dataset. We also collect two new datasets called WebAction-20 and WebAction-100 with 20 and 100 categories, respectively. These datasets will be made publicly available soon.

Section snippets

Weakly-supervised temporal action localization (WTAL)

WTAL only requires video-level annotations and has drawn extensive attention recently. Most existing methods adopt the Multiple Instance Learning formulation to predict temporal action score sequences from video-level labels [13], [14], [15], [16], [18], [19]. UntrimmdNets [13] address this problem by conducting the clip proposal classification first and then select relevant segments in a soft or hard manner. STPN [14] imposes a sparsity constraint to enforce the sparsity of the selected

Method

In this section, we present the details of our proposed WebTAL framework. The overall architecture is illustrated in Fig. 3. Before the detailed description, we first formulate the webly-supervised temporal action localization (WebTAL) problem.

Given a set of C action categories {c_i}_i=1^C, we use these category names as the query keywords to retrieve web videos and construct the training dataset V = {V_n, y_n}_n=1^N, where V_n is a crawled web video and y_n ∈ ℝ^C is its video-level label. For each

Experiments

In this section, we evaluate our method with extensive experiments. We first describe details of our constructed new training datasets, the testing benchmarks and the implementation details respectively, followed by the comparison with the state-of-the-art methods and ablation studies. Finally, we qualitatively demonstrate the superiority of our method in spatio-temporal representation learning and temporal action localization.

Conclusion

In this paper, we addressed the webly-supervised temporal action localization (WebTAL) task by designing a preprocessing-free and noise-insensitive framework along with a novel synergic learning paradigm. We proposed a synergic task called STOP to explore the self-contained video spatio-temporal order to prevent feature learning from being seriously interfered by noisy labels. Our synergic learning scheme WST enables branch-level message passing, thereby iteratively generating better

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This paper was partially supported by the IER foundation (No. HT-JD-CXY-201904) and Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing). Special acknowledgements are given to Aoto-PKUSZ Joint Lab for its support.

References (56)

H. Idrees et al.
The thumos challenge on action recognition for videos “in the wild”
Comput. Vis. Image Underst.
(2017)
C. Rupprecht et al.
Learning without prejudice: avoiding bias in webly-supervised action recognition
Comput. Vis. Image Underst.
(2018)
D. Liu et al.
Completeness modeling and context separation for weakly supervised temporal action localization
S. Ma et al.
Do less and achieve more: training cnns for action recognition utilizing action images from the web
Pattern Recogn.
(2017)
S. Vishwakarma et al.
A survey on activity recognition and behavior understanding in video surveillance
Vis. Comput.
(2013)
Y.J. Lee et al.
Discovering important people and objects for egocentric video summarization
Y.-F. Ma et al.
A generic framework of user attention model and its application in video summarization
IEEE Trans. Multimed.
(2005)
H. Xu et al.
R-c3d: region convolutional 3d network for temporal activity detection
Y.-W. Chao et al.
Rethinking the faster r-cnn architecture for temporal action localization
R. Zeng et al.
Graph convolutional networks for temporal action localization

M. Xu et al.

G-tad: Sub-graph localization for temporal action detection

F. Caba Heilbron et al.

Activitynet: a large-scale video benchmark for human activity understanding

C. Sun et al.

Temporal localization of fine-grained actions in videos by domain transfer from web images

W. Sultani et al.

What if we do not have multiple videos of the same action?–video action localization using web images

L. Wang et al.

Untrimmednets for weakly supervised action recognition and detection

P. Nguyen et al.

Weakly supervised action localization by sparse temporal pooling network

P.X. Nguyen et al.

Weakly-supervised action localization with background modeling

P. Lee et al.

Background suppression network for weakly-supervised temporal action localization

J. Carreira et al.

Quo vadis, action recognition? A new model and the kinetics dataset

S. Narayan et al.

3c-net: category count and center loss for weakly-supervised action localization

C. Gan et al.

You lead, we exceed: labor-free video concept learning by jointly exploiting web videos and images

K.-H. Lee et al.

Cleannet: transfer learning for scalable image classifier training with label noise

J. Yang et al.

Recognition from web data: a progressive filtering approach

IEEE Trans. Image Process.

(2018)

S. Guo et al.

Curriculumnet: weakly supervised learning from large-scale web images

Duan H. et al.

S.K. Divvala et al.

Learning everything about anything: webly-supervised visual concept learning

X. Chen et al.

Webly supervised learning of convolutional networks

G. Ye et al.

Eventnet: a large scale structured concept library for complex event detection in video

Cited by (6)

Correspondence Matters for Video Referring Expression Comprehension
2022, MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
Video Referring Expression Comprehension via Transformer with Content-aware Query
2022, arXiv
Correspondence Matters for Video Referring Expression Comprehension
2022, arXiv
LocVTP: Video-Text Pre-training for Temporal Localization
2022, arXiv
LocVTP: Video-Text Pre-training for Temporal Localization
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Deep Motion Prior for Weakly-Supervised Temporal Action Localization
2021, arXiv

View full text