Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation

Zheng, Jingye; Chen, Dihu; Hu, Haifeng

doi:10.1007/s11063-021-10500-2

Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation

Published: 15 May 2021

Volume 53, pages 2813–2828, (2021)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Jingye Zheng¹,
Dihu Chen¹ &
Haifeng Hu¹

236 Accesses
5 Citations
Explore all metrics

Abstract

Detecting temporal actions in long and untrimmed videos is a challenging and important field in computer vision. Generating high-quality proposals is a key step in temporal action detection. A high-quality proposal usually contains two main characteristics. One is the temporal overlaps between proposals and action instances should be as large as possible. The another one is the number of generated proposals should be as few as possible. Inspired by the similarity comparison in face recognition and the similarity of action in same action segment, we design a module to compare the similarity for visual features extracted from visual feature encoder. We find out time points where the similarity of features changes shapely to generate candidate proposals. Then, we train a classifier to evaluate the candidate proposals whether contains or not contains action instances. The experiments suggest that our method outperforms other temporal action proposal generation methods in THUMOS-14 dataset and ActivityNet-v1.3 dataset. In addition, our method still outperforms other methods when using different visual features extracted from different networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

Visual attention network

Article Open access 28 July 2023

References

Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. Proceedings of the IEEE conference on computer vision and pattern recognition 1049-1058
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. Proceedings of the IEEE international conference on computer vision 2914-2923
Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation. Proceedings of the European conference on computer vision 68-83
Yang K, Qiao P, Li D, et al. (2018) Exploring temporal preservation networks for precise temporal action localization. Thirty-Second AAAI Conference on Artificial Intelligence
Karaman S, Seidenari L, Del Bimbo A (2014) Fast saliency based pooling of fisher encoded dense trajectories. ECCV THUMOS Workshop. 1(2):5
Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos. 2014: ECCV THUMOS Workshop
Wang L, Qiao Y, Tang X (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn Chall 1(2):2
Google Scholar
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. Proceedings of the European conference on computer vision 3-19
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Proceedings of the European conference on computer vision 803-818
Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krhenbuhl P (2018) Compressed video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 6026-6035
Zhang L, Yang M, Feng X (2011) Sparse representation or collaborative representation: Which helps face recognition?. Proceedings of the IEEE conference on computer vision and pattern recognition 471-478
Zheng J, Yang P, Chen S, Shen G, Wang W (2017) Iterative re-constrained group sparse face recognition with adaptive weights learning. IEEE Trans Image Process 26(5):2408–2423
Article MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems 568-576
Wang H, Schmid C (2013) Action recognition with improved trajectories. Proceedings of the IEEE international conference on computer vision 3551-3558
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79
Article MathSciNet Google Scholar
Wang H, Kl\(\ddot{a}\)ser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. Proceedings of the IEEE conference on computer vision and pattern recognition 3169-3176
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition 1725-1732
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497
Buch S, Escorcia V, Ghanem B, Fei-Fei L, Niebles JC (2017) End-to-end, single-stream temporal action detection in untrimmed videos. Proceedings of the British machine vision conference. 1:2
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: Deep action proposals for action understanding. In European conference on computer vision 768-784
Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: Temporal unit regression network for temporal action proposals. Proceedings of the IEEE international conference on computer vision 3628-3636
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180
Xiong Y, Zhao Y, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection, arXiv:1703.02716 [Online]. Available: https://arxiv.org/abs/1703.02716
Caba Heilbron F, Carlos Niebles J, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. Proceedings of the IEEE conference on computer vision and pattern recognition 1914-1923
Laptev I (2015) On space-time interest points. Int J Comput Vision 64(2–3):107–123
Google Scholar
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS–Improving Object Detection With One Line of Code. Proceedings of the IEEE conference on computer vision 5561-5569
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE conference on computer vision and pattern recognition 961-970
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes. In: ECCV Workshop
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Xiong Y, Wang L, Wang Z, Zhang B, Song H, Li W, Lin D, Qiao Y, Gool LV,Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797
Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: Submission to activitynet 2017, arXiv:1707.06750. [Online]. Available: https://arxiv.org/abs/1707.06750
Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1130-1139
Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J (2017) Sst: Single-stream temporal action proposals. Proceedings of the IEEE conference on computer vision and pattern recognition 2911-2920
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497
Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: Convolutional-deconvolutional networks for precise temporal action localization in untrimmed videos. arXiv preprint arXiv:1703.01515
Xu H, Das A, Saenko K (2017) R-c3d: Region convolutional 3d network for temporal activity detection. In proceedings of the IEEE international conference on computer vision (ICCV) 5783-5792
Tran D, Ray J, Shou Z, et al. (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029, 2016B010109002, 2015B090912001, 2016B010123005, and 2017B090909005; in part by the Science and Technology Program of Guangzhou under Grants 201704020180 and 201604020024; and in part by the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Sun Yat-Sen University, Guangzhou, China
Jingye Zheng, Dihu Chen & Haifeng Hu

Authors

Jingye Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Dihu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dihu Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, J., Chen, D. & Hu, H. Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation. Neural Process Lett 53, 2813–2828 (2021). https://doi.org/10.1007/s11063-021-10500-2

Download citation

Accepted: 16 March 2021
Published: 15 May 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11063-021-10500-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation