Skip to main content
Log in

\(\hbox {PISEP}{^2}\): pseudo-image sequence evolution-based 3D pose prediction

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Pose prediction is to predict future poses given a window of previous poses. In this paper, we propose a new problem that predicts poses using 3D positions of skeletal sequences.Different from the traditional pose prediction based on mocap frames, this problem is convenient to use in real applications due to its simple sensors to capture data. We also present a new framework, pseudo-image sequence evolution-based 3D pose prediction, to address this new problem. Specifically, a skeletal representation is proposed by transforming a 3D skeletal sequence into an image sequence, which can model different correlations among different joints. With this image-based skeletal representation, we model the pose prediction as the evolution of an image sequence. Moreover, a novel inference network is proposed to predict multiple future poses in a non-recursive manner using decoders with independent parameters. In contrast to the recursive sequence-to-sequence model, we can improve the computational efficiency and avoid error accumulations significantly. Extensive experiments are carried out on two benchmark datasets (e.g., G3D and FNTU). The proposed method achieves state-of-the-art performance on both datasets, which demonstrates the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Afrasiabi, M., Mansoorizadeh, M., et al.: DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis. Comput. 36, 1–13 (2019)

    Google Scholar 

  2. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (2017)

  3. Bloom, V., Makris, D., Argyriou, V.: G3d: A gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, pp. 7–12 (2012)

  4. Bors, A.G., Pitas, I.: Prediction and tracking of moving objects in image sequences. IEEE Trans. Image Process. 9(8), 1441–1445 (2000)

    Article  Google Scholar 

  5. Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166 (2017)

  6. Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., Vidal, R.: Bio-inspired dynamic 3D discriminative skeletal features for human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 471–478 (2013)

  7. Chiu, H.k., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: 2019 IEEE Winter Conference on Applications of Computer Vision. IEEE, pp. 1423–1432 (2019)

  8. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)

  9. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems, pp. 64–72 (2016)

  10. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)

  11. Grassia, F.S.: Practical parameterization of rotations using the exponential map. J. Graph. Tools 3(3), 29–48 (1998)

    Article  Google Scholar 

  12. Gui, L.Y., Wang, Y.X., Liang, X., Moura, J.M.: Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision, pp. 786–803 (2018)

  13. Gui, L.Y., Wang, Y.X., Ramanan, D., Moura, J.M.: Few-shot human motion prediction via meta-learning. In: Proceedings of the European Conference on Computer Vision, pp 432–450 (2018)

  14. Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans. Cybern. 43(5), 1318–1334 (2013)

    Article  Google Scholar 

  15. Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35(4), 138 (2016)

    Article  Google Scholar 

  16. Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L.F., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction. In: Advances in Neural Information Processing Systems, pp. 517–526 (2018)

  17. Hussein, N., Gavves, E., Smeulders, A.W.: Timeception for complex action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 254–263 (2019)

  18. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human 3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Article  Google Scholar 

  19. Jain, A., Zamir, A.R,. Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)

  20. Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1771–1779 (2017)

  21. Kamisli, F.: Block-based spatial prediction and transforms based on 2D Markov processes for image and video compression. IEEE Trans. Image Process. 24(4), 1247–1260 (2015)

    Article  MathSciNet  Google Scholar 

  22. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

  23. Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:180611230 (2018)

  24. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International Joint Conference on Artificial Intelligence, pp. 786–792 (2018). https://doi.org/10.24963/ijcai.2018/109

  25. Li, Y., Wang, Z., Yang, X., Wang, M., Poiana, S.I., Chaudhry, E., Zhang, J.: Efficient convolutional hierarchical autoencoder for human motion prediction. Vis. Comput. 35(6–8), 1143–1156 (2019)

    Article  Google Scholar 

  26. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded video prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1744–1752 (2017)

  27. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: International Conference on Learning Representations (2017)

  28. Lu, C., Hirsch, M., Scholkopf, B.: Flexible spatio-temporal networks for video prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6523–6531 (2017)

  29. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)

  30. Nie, Q., Wang, J., Wang, X., Liu, Y.: View-invariant human action recognition based on a 3D bio-constrained skeleton model. IEEE Trans. Image Process. 28, 3959–3972 (2019)

    Article  MathSciNet  Google Scholar 

  31. Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems, pp. 2863–2871 (2015)

  32. Qin, Y., Mo, L., Li, C., Luo, J.: Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 36(3), 621–631 (2020)

    Article  Google Scholar 

  33. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

  34. Tang, Y., Ma, L., Liu, W., Zheng, W.: Long-term human motion prediction by modeling motion context and enhancing motion dynamic (2018). arXiv preprint arXiv:180502513

  35. Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems, pp. 1345–1352 (2007)

  36. Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3d pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509 (2017)

  37. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

  38. Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)

    Article  Google Scholar 

  39. Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3302–3309 (2014)

  40. Wang, Y., Long, M., Wang, J., Gao, Z., Philip, S.Y.: Predrnn: recurrent neural networks for predictive learning using spatiotemporal lstms. In: Advances in Neural Information Processing Systems, pp. 879–888 (2017)

  41. Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: Predrnn++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: International Conference on Machine Learning (2018)

  42. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision, pp. 305–321 (2018)

  43. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)

  44. Xu, J., Ni, B., Li, Z., Cheng, S., Yang, X.: Structure preserving video prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1460–1469 (2018)

  45. Xu, Z., Wang, Y., Long, M., Wang, J., KLiss, M.: Predcnn: predictive learning with cascade convolutions. In: International Joint Conference on Artificial Intelligence, pp. 2940–2947 (2018)

  46. Zhang, H., Parker, L.E.: Bio-inspired predictive orientation decomposition of skeleton trajectories for real-time human activity prediction. In: 2015 IEEE International Conference on Robotics and Automation, IEEE, pp. 3053–3060 (2015)

  47. Zhang, J., Zheng, Y., Qi, D., Li, R., Yi, X.: Dnn-based prediction model for spatio-temporal data. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 92. ACM (2016)

  48. Zhang, J., Zheng, Y., Qi, D.: Deep spatio-temporal residual networks for citywide crowd flows prediction. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

Download references

Acknowledgements

This work was supported partly by the National Natural Science Foundation of China (Grant No. 61673192), the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-1, 2019RC27), and BUPT Excellent Ph.D. Students Foundation (CX2019111). The research in this paper used the NTU RGB+D Action Recognition Dataset made available by the ROSE Lab at the Nanyang Technological University, Singapore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianqin Yin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Yin, J., Liu, H. et al. \(\hbox {PISEP}{^2}\): pseudo-image sequence evolution-based 3D pose prediction. Vis Comput 38, 2603–2616 (2022). https://doi.org/10.1007/s00371-021-02135-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02135-0

Keywords

Navigation