Complementary spatiotemporal network for video question answering

Li, Xinrui; Wu, Aming; Han, Yahong

doi:10.1007/s00530-021-00805-6

Complementary spatiotemporal network for video question answering

Regular Paper
Published: 02 June 2021

Volume 28, pages 161–169, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

389 Accesses
6 Citations
Explore all metrics

Abstract

Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds in the video, these methods will be difficult to distinguish spatial details and different scale motion relationships. To this end, we propose a complementary spatiotemporal network (CST) to focus on multi-scale motion relationships and essential spatial semantics. Our model involves three modules. First, multi-scale relation unit (MR) captures temporal information by modeling different distances between motions. Second, mask similarity (MS) operation captures discriminative spatial semantics in a less redundant manner. And cross-modality attention (CMA) boosts the interaction between different modalities. We evaluate our method on three benchmark datasets and conduct extensive ablation studies. The performance improvement demonstrates the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-Temporal Context Networks for Video Question Answering

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Article 13 February 2019

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context. arXiv:1901.02860 (2019)
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634 (2015)
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1999–2007 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211 (2019)
Gao, J., Ge, R., Chen, K., Nevatia, R.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6576–6585 (2018)
Ghiasi, G., Lin, T.Y., Le, Q.V.: Dropblock: A regularization method for convolutional networks. arXiv:1810.12890 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European conference on computer vision, pp. 388–405. Springer (2020)
Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C.: Location-aware graph convolutional networks for video question answering. Proc. AAAI Conf. Artif. Intell. 34, 11021–11028 (2020)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2758–2766 (2017)
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., Zhuang, Y.: Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp. 1193–1201 (2019)
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: Deepstory: video story qa by deep embedded memory networks. arXiv:1707.00836 (2017)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv:1609.02907 (2016)
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981 (2020)
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T.L., Bansal, M.: Mart: memory-augmented recurrent transformer for coherent video paragraph captioning. arXiv:2005.05402 (2020)
Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: Localized, compositional video question answering. arXiv:1809.01696 (2018)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: Tvr: a large-scale dataset for video-subtitle moment retrieval. arXiv:2001.09099 (2020)
Li, X., Song, J., Gao, L., Liu, X., Huang, W., He, X., Gan, C.: Beyond rnns: positional self-attention with co-attention for video question answering. Proc. AAAI Conf. Artif. Intell. 33, 8658–8665 (2019)
Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv:1803.02155 (2018)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. arXiv:1503.08895 (2015)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv:1706.03762 (2017)
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653 (2017)
Yan, C., Gong, B., Wei, Y., Gao, Y.: Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intel. 43(4), 1445–1451 (2021)
Article Google Scholar
Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., Chen, Z., Gao, X.: Task-adaptive attention for image captioning. IEEE Trans. Circ. Syst. Video Technol. 14(8) (2021)
Yan, C., Li, Z., Zhang, Y., Liu, Y., Ji, X., Zhang, Y.: Depth image denoising using nuclear norm and learning graph model. ACM Trans. Multimedia Comput. Commun. Appl. 16(4), 1–17 (2020)
Article Google Scholar
Yan, C., Shao, B., Zhao, H., Ning, R., Zhang, Y., Xu, F.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimedia 22(11), 3014–3024 (2020)
Article Google Scholar
Zhao, Z., Zhang, Z., Xiao, S., Xiao, Z., Yan, X., Yu, J., Cai, D., Wu, F.: Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Trans. Image Process. 28(12), 5939–5952 (2019)
Article MathSciNet Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8739–8748 (2018)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8746–8755 (2020)

Download references

Funding

This research was supported by National Natural Science Foundation of China (Grant Nos. 61876130, 61932009).

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Xinrui Li, Aming Wu & Yahong Han

Authors

Xinrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Aming Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yahong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yahong Han.

Additional information

Communicated by Y. Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Wu, A. & Han, Y. Complementary spatiotemporal network for video question answering. Multimedia Systems 28, 161–169 (2022). https://doi.org/10.1007/s00530-021-00805-6

Download citation

Received: 29 March 2021
Accepted: 30 April 2021
Published: 02 June 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00530-021-00805-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complementary spatiotemporal network for video question answering

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Context Networks for Video Question Answering

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Complementary spatiotemporal network for video question answering

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Context Networks for Video Question Answering

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation