Skip to main content
Log in

Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Detecting temporal actions in long and untrimmed videos is a challenging and important field in computer vision. Generating high-quality proposals is a key step in temporal action detection. A high-quality proposal usually contains two main characteristics. One is the temporal overlaps between proposals and action instances should be as large as possible. The another one is the number of generated proposals should be as few as possible. Inspired by the similarity comparison in face recognition and the similarity of action in same action segment, we design a module to compare the similarity for visual features extracted from visual feature encoder. We find out time points where the similarity of features changes shapely to generate candidate proposals. Then, we train a classifier to evaluate the candidate proposals whether contains or not contains action instances. The experiments suggest that our method outperforms other temporal action proposal generation methods in THUMOS-14 dataset and ActivityNet-v1.3 dataset. In addition, our method still outperforms other methods when using different visual features extracted from different networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. Proceedings of the IEEE conference on computer vision and pattern recognition 1049-1058

  2. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. Proceedings of the IEEE international conference on computer vision 2914-2923

  3. Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation. Proceedings of the European conference on computer vision 68-83

  4. Yang K, Qiao P, Li D, et al. (2018) Exploring temporal preservation networks for precise temporal action localization. Thirty-Second AAAI Conference on Artificial Intelligence

  5. Karaman S, Seidenari L, Del Bimbo A (2014) Fast saliency based pooling of fisher encoded dense trajectories. ECCV THUMOS Workshop. 1(2):5

  6. Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos. 2014: ECCV THUMOS Workshop

  7. Wang L, Qiao Y, Tang X (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn Chall 1(2):2

    Google Scholar 

  8. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. Proceedings of the European conference on computer vision 3-19

  9. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Proceedings of the European conference on computer vision 803-818

  10. Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krhenbuhl P (2018) Compressed video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 6026-6035

  11. Zhang L, Yang M, Feng X (2011) Sparse representation or collaborative representation: Which helps face recognition?. Proceedings of the IEEE conference on computer vision and pattern recognition 471-478

  12. Zheng J, Yang P, Chen S, Shen G, Wang W (2017) Iterative re-constrained group sparse face recognition with adaptive weights learning. IEEE Trans Image Process 26(5):2408–2423

    Article  MathSciNet  Google Scholar 

  13. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems 568-576

  14. Wang H, Schmid C (2013) Action recognition with improved trajectories. Proceedings of the IEEE international conference on computer vision 3551-3558

  15. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79

    Article  MathSciNet  Google Scholar 

  16. Wang H, Kl\(\ddot{a}\)ser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. Proceedings of the IEEE conference on computer vision and pattern recognition 3169-3176

  17. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition 1725-1732

  18. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497

  19. Buch S, Escorcia V, Ghanem B, Fei-Fei L, Niebles JC (2017) End-to-end, single-stream temporal action detection in untrimmed videos. Proceedings of the British machine vision conference. 1:2

  20. Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: Deep action proposals for action understanding. In European conference on computer vision 768-784

  21. Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: Temporal unit regression network for temporal action proposals. Proceedings of the IEEE international conference on computer vision 3628-3636

  22. Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180

  23. Xiong Y, Zhao Y, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection, arXiv:1703.02716 [Online]. Available: https://arxiv.org/abs/1703.02716

  24. Caba Heilbron F, Carlos Niebles J, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. Proceedings of the IEEE conference on computer vision and pattern recognition 1914-1923

  25. Laptev I (2015) On space-time interest points. Int J Comput Vision 64(2–3):107–123

    Google Scholar 

  26. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS–Improving Object Detection With One Line of Code. Proceedings of the IEEE conference on computer vision 5561-5569

  27. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE conference on computer vision and pattern recognition 961-970

  28. Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes. In: ECCV Workshop

  29. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  30. Xiong Y, Wang L, Wang Z, Zhang B, Song H, Li W, Lin D, Qiao Y, Gool LV,Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797

  31. Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: Submission to activitynet 2017, arXiv:1707.06750. [Online]. Available: https://arxiv.org/abs/1707.06750

  32. Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1130-1139

  33. Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J (2017) Sst: Single-stream temporal action proposals. Proceedings of the IEEE conference on computer vision and pattern recognition 2911-2920

  34. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE conference on computer vision 4489-4497

  35. Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: Convolutional-deconvolutional networks for precise temporal action localization in untrimmed videos. arXiv preprint arXiv:1703.01515

  36. Xu H, Das A, Saenko K (2017) R-c3d: Region convolutional 3d network for temporal activity detection. In proceedings of the IEEE international conference on computer vision (ICCV) 5783-5792

  37. Tran D, Ray J, Shou Z, et al. (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029, 2016B010109002, 2015B090912001, 2016B010123005, and 2017B090909005; in part by the Science and Technology Program of Guangzhou under Grants 201704020180 and 201604020024; and in part by the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dihu Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, J., Chen, D. & Hu, H. Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation. Neural Process Lett 53, 2813–2828 (2021). https://doi.org/10.1007/s11063-021-10500-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10500-2

Keywords

Navigation