Skip to main content
Log in

Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation

  • Regular Papers
  • Robot and Applications
  • Published:
International Journal of Control, Automation and Systems Aims and scope Submit manuscript

Abstract

Discovering task subsequences from a continuous video stream facilitates a robot imitation of sequential tasks. In this research, we develop unsupervised learning of the task subsequences which does not require a human teacher to give the supervised label of the subsequence. Task-discriminative feature, in the form of sparsely activated cells called task capsules, is proposed for self-training to preserve spatio-semantic information of a visual input. The task capsules are sparsely and exclusively activated with respect to the spatio-semantic context of the task subsequence: a type and location of the object. Therefore, the generalized purpose in multiple videos is unsupervisedly discovered according to the spatio-semantic context, and the demonstration is segmented into the task subsequences in an object-centric way. In comparison with the existing studies on unsupervised task segmentation, our work has the following distinct contribution: 1) the task provided as a video stream can be segmented without any pre-defined knowledge, 2) the trained features preserve spatio-semantic information so that the segmentation is object-centric. Our experiment shows that the recognition of the task subsequence can be applied to robot imitation for a sequential pick-and-place task by providing the semantic and location information of the object to be manipulated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” Proc. of Conference on Computer Vision and Pattern Recognition, pp. 5729–5738, 2017.

  2. H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” Proc. of International Conference on Computer Vision, pp. 667–676, 2017.

  3. T. Xu and E. Wong, “Learning temporal structures for human activity recognition,” Proc. of British Machine Vision Conference, 2017.

  4. H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Superframes, a temporal video segmentation,” Proc. of International Conference on Pattern Recognition, pp. 566–571, 2018.

  5. W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual cooccurrence,” Proc. of Conference on Computer Vision and Pattern Recognition, pp. 3584–3592, 2015.

  6. S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.

  7. G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” Proc. of International Conference on Learning Representations, 2018.

  8. S. H. Lee, I. H. Suh, S. Calinon, and R. Johansson, “Learning basis skills by autonomous segmentation of humanoid motion trajectories,” Conference on Humanoid Robots, pp. 112–119, 2012.

  9. L. Pais, K. Umezawa, Y. Nakamura, and A. Billard, “Learning robot skills through motion segmentation and constraints extraction,” Proc. of International Conference on Human-Robot Interaction, Workshop on Collaborative Manipulation, 2013.

  10. S. Cho and S. Jo, “Incremental motion learning through kinesthetic teachings and new motion production from learned motions by a humanoid robot,” International Journal of Control, Automation and Systems, vol. 10, no. 1, pp. 126–135, 2012.

    Article  Google Scholar 

  11. Y. Zhao, A. Al-Yacoub, Y. M. Goh, L. Justham, N. Lohse, and M. R. Jackson, “Human skill capture: A hidden markov model of force and torque data in peg-in-a-hole assembly process,” Proc. of International Conference on Systems, Man, and Cybernetics, pp. 000655–000660, 2016.

  12. R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multilevel discovery of deep options,” Computing Research Repository (CoRR), vol. abs/1703.08294, 2017.

  13. K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” Conference on Computer Vision and Pattern Recognition, pp. 1250–1257, 2012.

  14. S. Pillai, M. R. Walter, and S. J. Teller, “Learning articulated motions from visual demonstration,” Robotics: Science and Systems X, 2014.

  15. E. Ugur, Y. Nagai, E. Sahin, and E. Oztop, “Staged development of robot skills: Behavior formation, affordance learning and imitation with motionese,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 2, pp. 119–139, 2015.

    Article  Google Scholar 

  16. Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,” Advances in Neural Information Processing Systems, pp. 1087–1098, 2017.

  17. M. Wächter and T. Asfour, “Hierarchical segmentation of manipulation actions based on object relations and motion characteristics,” Proc.of International Conference on Advanced Robotics, pp. 549–556, 2015.

  18. C. Wu, J. Zhang, S. Savarese, and A. Saxena, “Watch-n-patch: Unsupervised understanding of actions and relations,” Conference on Computer Vision and Pattern Recognition, pp. 4362–4370, 2015.

  19. A. Mohseni-Kabir, C. Rich, S. Chernova, C. L. Sidner, and D. Miller, “Interactive hierarchical task learning from a single demonstration,” International Conference on Human-Robot Interaction, pp. 205–212, 2015.

  20. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, pp. 91–99, 2015.

  21. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.

  22. R. Gao, D. Jayaraman, and K. Grauman, “Object-centric representation learning from unlabeled videos,” Proc. of Asian Conference on Computer Vision, pp. 248–263, 2016.

  23. C. Devin, P. Abbeel, T. Darrell, and S. Levine, “Deep object-centric representations for generalizable robot learning,” International Conference on Robotics and Automation, pp. 7111–7118, 2018.

  24. D. Rawlinson, A. Ahmed, and G. Kowadlo, “Sparse unsupervised capsules generalize better,” Computing Research Repository (CoRR), vol. abs/1804.06094, 2018.

  25. Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas, “Playing hard exploration games by watching youtube,” Advances in Neural Information Processing Systems, pp. 2935–2945, 2018.

  26. P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine, “Time-contrastive networks: Self-supervised learning from video,” Proc. of International Conference on Robotics and Automation, pp. 1134–1141, 2018.

  27. M. K. Patrick, A. F. Adekoya, A. A. Mighty, and B. Y. Edward, “Capsule networks-a survey,” Journal of King Saud University-Computer and Information Sciences, 2019.

  28. M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang, “Investigating capsule networks with dynamic routing for text classification,” Conference on Empirical Methods in Natural Language Processing, pp. 3110–3119, 2018.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. Jin Kim.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This journal was supported by the Agency for Defense Development under Contract UD 190026RD.

J.hyeon Park received his B.S. degree in mechanical and aerospace engineering from Seoul National University in 2015. He is currently pursuing an integrated M.S./Ph.D. degree in the Department of Mechanical and Aerospace Engineering at Seoul National University. His research interests include deep learning in robotics for perception and action.

Jigang Kim received his B.S. degree in mechanical and aerospace engineering from Seoul National University in 2018. He is currently pursuing an integrated M.S./Ph.D. degree in the Department of Mechanical and Aerospace Engineering at Seoul National University. His research interests include robot learning, machine learning, reinforcement learning.

H. Jin Kim received her B.S. degree from Korea Advanced Institute of Technology (KAIST) in 1995, and her M.S. and Ph.D. degrees in Mechanical Engineering from University of California, Berkeley (UC Berkeley), in 1999 and 2001, respectively. In September 2004 she joined the Department of Mechanical and Aerospace Engineering at Seoul National University, as an Assistant Professor where she is currently a Professor. Her research interests include autonomous robotics and robot vision.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, J.H., Kim, J. & Kim, H.J. Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation. Int. J. Control Autom. Syst. 19, 3409–3418 (2021). https://doi.org/10.1007/s12555-020-0155-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12555-020-0155-9

Keywords

Navigation