Abstract
Detecting temporal actions in long and untrimmed videos is a challenging and important field in computer vision. Generating high-quality proposals is a key step in temporal action detection. A high-quality proposal usually contains two main characteristics. One is the temporal overlaps between proposals and action instances should be as large as possible. The another one is the number of generated proposals should be as few as possible. Inspired by the similarity comparison in face recognition and the similarity of action in same action segment, we design a module to compare the similarity for visual features extracted from visual feature encoder. We find out time points where the similarity of features changes shapely to generate candidate proposals. Then, we train a classifier to evaluate the candidate proposals whether contains or not contains action instances. The experiments suggest that our method outperforms other temporal action proposal generation methods in THUMOS-14 dataset and ActivityNet-v1.3 dataset. In addition, our method still outperforms other methods when using different visual features extracted from different networks.
Similar content being viewed by others
References
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. Proceedings of the IEEE conference on computer vision and pattern recognition 1049-1058
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. Proceedings of the IEEE international conference on computer vision 2914-2923
Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation. Proceedings of the European conference on computer vision 68-83
Yang K, Qiao P, Li D, et al. (2018) Exploring temporal preservation networks for precise temporal action localization. Thirty-Second AAAI Conference on Artificial Intelligence
Karaman S, Seidenari L, Del Bimbo A (2014) Fast saliency based pooling of fisher encoded dense trajectories. ECCV THUMOS Workshop. 1(2):5
Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos. 2014: ECCV THUMOS Workshop
Wang L, Qiao Y, Tang X (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn Chall 1(2):2
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. Proceedings of the European conference on computer vision 3-19
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Proceedings of the European conference on computer vision 803-818
Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krhenbuhl P (2018) Compressed video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 6026-6035
Zhang L, Yang M, Feng X (2011) Sparse representation or collaborative representation: Which helps face recognition?. Proceedings of the IEEE conference on computer vision and pattern recognition 471-478
Zheng J, Yang P, Chen S, Shen G, Wang W (2017) Iterative re-constrained group sparse face recognition with adaptive weights learning. IEEE Trans Image Process 26(5):2408–2423
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems 568-576
Wang H, Schmid C (2013) Action recognition with improved trajectories. Proceedings of the IEEE international conference on computer vision 3551-3558
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79
Wang H, Kl\(\ddot{a}\)ser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. Proceedings of the IEEE conference on computer vision and pattern recognition 3169-3176
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition 1725-1732
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497
Buch S, Escorcia V, Ghanem B, Fei-Fei L, Niebles JC (2017) End-to-end, single-stream temporal action detection in untrimmed videos. Proceedings of the British machine vision conference. 1:2
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: Deep action proposals for action understanding. In European conference on computer vision 768-784
Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: Temporal unit regression network for temporal action proposals. Proceedings of the IEEE international conference on computer vision 3628-3636
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180
Xiong Y, Zhao Y, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection, arXiv:1703.02716 [Online]. Available: https://arxiv.org/abs/1703.02716
Caba Heilbron F, Carlos Niebles J, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. Proceedings of the IEEE conference on computer vision and pattern recognition 1914-1923
Laptev I (2015) On space-time interest points. Int J Comput Vision 64(2–3):107–123
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS–Improving Object Detection With One Line of Code. Proceedings of the IEEE conference on computer vision 5561-5569
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE conference on computer vision and pattern recognition 961-970
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes. In: ECCV Workshop
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Xiong Y, Wang L, Wang Z, Zhang B, Song H, Li W, Lin D, Qiao Y, Gool LV,Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797
Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: Submission to activitynet 2017, arXiv:1707.06750. [Online]. Available: https://arxiv.org/abs/1707.06750
Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1130-1139
Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J (2017) Sst: Single-stream temporal action proposals. Proceedings of the IEEE conference on computer vision and pattern recognition 2911-2920
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497
Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: Convolutional-deconvolutional networks for precise temporal action localization in untrimmed videos. arXiv preprint arXiv:1703.01515
Xu H, Das A, Saenko K (2017) R-c3d: Region convolutional 3d network for temporal activity detection. In proceedings of the IEEE international conference on computer vision (ICCV) 5783-5792
Tran D, Ray J, Shou Z, et al. (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029, 2016B010109002, 2015B090912001, 2016B010123005, and 2017B090909005; in part by the Science and Technology Program of Guangzhou under Grants 201704020180 and 201604020024; and in part by the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zheng, J., Chen, D. & Hu, H. Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation. Neural Process Lett 53, 2813–2828 (2021). https://doi.org/10.1007/s11063-021-10500-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10500-2