Skip to main content
Log in

Complementary spatiotemporal network for video question answering

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds in the video, these methods will be difficult to distinguish spatial details and different scale motion relationships. To this end, we propose a complementary spatiotemporal network (CST) to focus on multi-scale motion relationships and essential spatial semantics. Our model involves three modules. First, multi-scale relation unit (MR) captures temporal information by modeling different distances between motions. Second, mask similarity (MS) operation captures discriminative spatial semantics in a less redundant manner. And cross-modality attention (CMA) boosts the interaction between different modalities. We evaluate our method on three benchmark datasets and conduct extensive ablation studies. The performance improvement demonstrates the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)

  2. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)

  3. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context. arXiv:1901.02860 (2019)

  4. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634 (2015)

  5. Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1999–2007 (2019)

  6. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211 (2019)

  7. Gao, J., Ge, R., Chen, K., Nevatia, R.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6576–6585 (2018)

  8. Ghiasi, G., Lin, T.Y., Le, Q.V.: Dropblock: A regularization method for convolutional networks. arXiv:1810.12890 (2018)

  9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  11. Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European conference on computer vision, pp. 388–405. Springer (2020)

  12. Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C.: Location-aware graph convolutional networks for video question answering. Proc. AAAI Conf. Artif. Intell. 34, 11021–11028 (2020)

    Google Scholar 

  13. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2758–2766 (2017)

  14. Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., Zhuang, Y.: Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp. 1193–1201 (2019)

  15. Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: Deepstory: video story qa by deep embedded memory networks. arXiv:1707.00836 (2017)

  16. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv:1609.02907 (2016)

  17. Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981 (2020)

  18. Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T.L., Bansal, M.: Mart: memory-augmented recurrent transformer for coherent video paragraph captioning. arXiv:2005.05402 (2020)

  19. Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: Localized, compositional video question answering. arXiv:1809.01696 (2018)

  20. Lei, J., Yu, L., Berg, T.L., Bansal, M.: Tvr: a large-scale dataset for video-subtitle moment retrieval. arXiv:2001.09099 (2020)

  21. Li, X., Song, J., Gao, L., Liu, X., Huang, W., He, X., Gan, C.: Beyond rnns: positional self-attention with co-attention for video question answering. Proc. AAAI Conf. Artif. Intell. 33, 8658–8665 (2019)

    Google Scholar 

  22. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv:1803.02155 (2018)

  23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  24. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. arXiv:1503.08895 (2015)

  25. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015)

  26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv:1706.03762 (2017)

  27. Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653 (2017)

  28. Yan, C., Gong, B., Wei, Y., Gao, Y.: Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intel. 43(4), 1445–1451 (2021)

    Article  Google Scholar 

  29. Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., Chen, Z., Gao, X.: Task-adaptive attention for image captioning. IEEE Trans. Circ. Syst. Video Technol. 14(8) (2021)

  30. Yan, C., Li, Z., Zhang, Y., Liu, Y., Ji, X., Zhang, Y.: Depth image denoising using nuclear norm and learning graph model. ACM Trans. Multimedia Comput. Commun. Appl. 16(4), 1–17 (2020)

    Article  Google Scholar 

  31. Yan, C., Shao, B., Zhao, H., Ning, R., Zhang, Y., Xu, F.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimedia 22(11), 3014–3024 (2020)

    Article  Google Scholar 

  32. Zhao, Z., Zhang, Z., Xiao, S., Xiao, Z., Yan, X., Yu, J., Cai, D., Wu, F.: Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Trans. Image Process. 28(12), 5939–5952 (2019)

    Article  MathSciNet  Google Scholar 

  33. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8739–8748 (2018)

  34. Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8746–8755 (2020)

Download references

Funding

This research was supported by National Natural Science Foundation of China (Grant Nos. 61876130, 61932009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yahong Han.

Additional information

Communicated by Y. Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Wu, A. & Han, Y. Complementary spatiotemporal network for video question answering. Multimedia Systems 28, 161–169 (2022). https://doi.org/10.1007/s00530-021-00805-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00805-6

Keywords

Navigation