Complementary Boundary Estimation Network for Temporal Action Proposal Generation

Wang, Jinding; Hu, Haifeng

doi:10.1007/s11063-020-10349-x

Complementary Boundary Estimation Network for Temporal Action Proposal Generation

Published: 17 September 2020

Volume 52, pages 2275–2295, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

204 Accesses
1 Citation
6 Altmetric
Explore all metrics

Abstract

Temporal Action Detection is an important yet challenging task, in which temporal action proposal generation plays an important part. Since the temporal boundaries of action instances in videos are often ambiguous, it’s difficult to locate them precisely. Boundary Sensitive Network (BSN) (Lin et al. in ECCV, 2018) is a state-of-the-art corner-based method that can generate high-quality proposals with high recall rate. It contains a temporal evaluation network and a proposal evaluation network to generate and evaluate proposals separately, which can find the temporal boundaries of action instances directly to produce proposals with flexible temporal intervals and evaluate the quality of proposals. But BSN still has some issues: (1) Due to the small reception field of temporal evaluation network, it often generates many false temporal boundaries. (2) Evaluating the quality of proposals is a difficult task and not well solved in the paper. To address these issues, we propose Complementary Boundary Estimation Network (CBEN), an improved approach to temporal action proposal generation based on the framework of BSN. Specifically, we improve BSN in two aspects: Firstly, considering the temporal evaluation network of BSN can only capture local information and tends to have high response at background segments, we combine it with a new network with larger reception field to better identify false temporal action boundaries. Secondly, to evaluate the quality of temporal action proposals more accurately, we propose a class-based proposal evaluation network and combine it with a tIoU-based proposal evaluation network to filter out low-quality proposals. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets indicate that CBEN can achieve better performance than current mainstream methods on temporal action proposal generation. We further combine CBEN with an off-the-shelf action classifier, and show consistent performance improvements on THUMOS14 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Boundary discrimination and proposal evaluation for temporal action proposal generation

Article 11 September 2020

Tianyu Li, Bing Bing & Xinxiao Wu

CTAP: Complementary Temporal Action Proposal Generation

References

Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: ICCV, pp 5561–5569
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) Sst: single-stream temporal action proposals. In: CVPR, pp 2911–2920
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970
Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: CVPR, pp 1130–1139
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941
Gao J, Chen K, Nevatia R (2018) Ctap: complementary temporal action proposal generation. In: ECCV, pp 68–83
Gao J, Yang Z, Chen S, Kan C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: ICCV, pp 3628–3636
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: ECCV, pp 734–750
Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. In: ICCV
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: ACM international conference on multimedia, pp 988–996
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary sensitive network for temporal action proposal generation. In: ECCV, pp 3–19
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: ECCV
Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: CVPR, pp 7363–7372
Luo W, Li Y, Urtasun R, Zemel RS (2016) Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp 4898–4906
Newell A, Yang K, Jia D (2016) Stacked hourglass networks for human pose estimation. In: ECCV, pp 483–499
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. PAMI 39(6):1137–1149
Article Google Scholar
Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR, pp 5734–5743
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR, pp 1049–1058
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497
Xiong Y, Wang L, Zhe W, Zhang B, Hang S, Wei L, Lin D, Yu Q, Gool LV, Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797
Xiong Y, Yue Z, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716
Xu H, Das A, Saenko K (2017) R-c3d: region convolutional 3d network for temporal activity detection. In: ICCV, pp 5783–5792
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: ICCV, pp 2914–2923

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029; in part by the Science and Technology Program of Guangzhou under Grants 201704020180; and in part by the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510275, China
Jinding Wang & Haifeng Hu

Authors

Jinding Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Calculation of tIoU

tIoU is short for temporal Intersection over Union. Figure 13 shows the beginning and ending time of a temporal action proposal and ground truth. We can calculate the tIoU between the proposal and ground truth by

$$\begin{aligned} tIoU=\frac{Intersection}{Union}=\frac{t_2-t_b}{t_e-t_1} \end{aligned}$$

(11)

The higher value of tIoU represents the proposal is closer to ground truth, e.g. the higher quality of the proposal.

Appendix B: Derivative of Mean Square Error

Assume we use mean square error to optimize the tIoU-based PEN, then the firs-order derivative of weights in the output layer can be calculated by

$$\begin{aligned} f(\mathbf {w})= & {} sigmoid(\mathbf {w}^\mathrm {T}\mathbf {x}^{(i)})\nonumber \\ J(\mathbf {w})= & {} \frac{1}{2n}\sum _{i=1}^{n}[y_i-f(\mathbf {w})]^2\nonumber \\ \frac{\partial {J}}{\partial {\mathbf {w}}}= & {} -\frac{1}{n}\sum _{i=1}^{n}(f(\mathbf {w})-y_i)(f(\mathbf {w})-1)f(\mathbf {w})\mathbf {x}^{(i)} \end{aligned}$$

(12)

To be noted, $g(x)=(x-y)(x-1)x$ is not an increasing function when $x,y\in (0,1)$, so $\frac{\partial {J}}{\partial {\mathbf {w}}}$ is not a increasing function either which means $J(\mathbf {w})$ is a non-convex function.

Appendix C: Derivative of Softmax Cross Entropy

Assume softmax cross entropy is used to optimize the class-based PEN, then we can calculate the first-order and second-order derivative of weights in the output layer by

$$\begin{aligned} a_j= & {} \frac{e^{\mathbf {w}_j^\mathrm {T}{\mathbf {x}}}}{\sum _{k=1}^{c}e^{\mathbf {w}_k^\mathrm {T}{\mathbf {x}}}}\nonumber \\ J(\mathbf {w})= & {} -\frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{c}y_j^{(i)}log a_j^{(i)}\nonumber \\ \frac{\partial {J}}{\partial {\mathbf {w}_n}}= & {} -\frac{1}{n}\sum _{i=1}^{n}[\mathbf {x}^{(i)}(y_n^{(i)}-a_n)]\nonumber \\ \frac{\partial {^2J}}{\partial {\mathbf {w}_n^2}}= & {} \frac{1}{n}\sum _{i=1}^{n}a_n(1-a_n)\mathbf {x}^{(i)}\mathbf {x}^{(i)\mathrm {T}} \end{aligned}$$

(13)

Considering $a_n\in (0,1)$, $\frac{\partial {^2J}}{\partial {\mathbf {w}_n^2}}$ is positive semi-definite matrix, which means $J(\mathbf {w})$ is a convex function.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Hu, H. Complementary Boundary Estimation Network for Temporal Action Proposal Generation. Neural Process Lett 52, 2275–2295 (2020). https://doi.org/10.1007/s11063-020-10349-x

Download citation

Accepted: 05 September 2020
Published: 17 September 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11063-020-10349-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complementary Boundary Estimation Network for Temporal Action Proposal Generation

Abstract

Access this article

Similar content being viewed by others

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Boundary discrimination and proposal evaluation for temporal action proposal generation

CTAP: Complementary Temporal Action Proposal Generation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Calculation of tIoU

Appendix B: Derivative of Mean Square Error

Appendix C: Derivative of Softmax Cross Entropy

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Boundary discrimination and proposal evaluation for temporal action proposal generation

CTAP: Complementary Temporal Action Proposal Generation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Calculation of tIoU

Appendix B: Derivative of Mean Square Error

Appendix C: Derivative of Softmax Cross Entropy

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation