Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation

Park, J. Hyeon; Kim, Jigang; Kim, H. Jin

doi:10.1007/s12555-020-0155-9

Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation

Regular Papers
Robot and Applications
Published: 02 September 2021

Volume 19, pages 3409–3418, (2021)
Cite this article

International Journal of Control, Automation and Systems Aims and scope Submit manuscript

141 Accesses
Explore all metrics

Abstract

Discovering task subsequences from a continuous video stream facilitates a robot imitation of sequential tasks. In this research, we develop unsupervised learning of the task subsequences which does not require a human teacher to give the supervised label of the subsequence. Task-discriminative feature, in the form of sparsely activated cells called task capsules, is proposed for self-training to preserve spatio-semantic information of a visual input. The task capsules are sparsely and exclusively activated with respect to the spatio-semantic context of the task subsequence: a type and location of the object. Therefore, the generalized purpose in multiple videos is unsupervisedly discovered according to the spatio-semantic context, and the demonstration is segmented into the task subsequences in an object-centric way. In comparison with the existing studies on unsupervised task segmentation, our work has the following distinct contribution: 1) the task provided as a video stream can be segmented without any pre-defined knowledge, 2) the trained features preserve spatio-semantic information so that the segmentation is object-centric. Our experiment shows that the recognition of the task subsequence can be applied to robot imitation for a sequential pick-and-place task by providing the semantic and location information of the object to be manipulated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

Learning Spatio-temporal Characteristics of Human Motions Through Observation

Self-supervised Sparse to Dense Motion Segmentation

References

B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” Proc. of Conference on Computer Vision and Pattern Recognition, pp. 5729–5738, 2017.
H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” Proc. of International Conference on Computer Vision, pp. 667–676, 2017.
T. Xu and E. Wong, “Learning temporal structures for human activity recognition,” Proc. of British Machine Vision Conference, 2017.
H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Superframes, a temporal video segmentation,” Proc. of International Conference on Pattern Recognition, pp. 566–571, 2018.
W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual cooccurrence,” Proc. of Conference on Computer Vision and Pattern Recognition, pp. 3584–3592, 2015.
S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.
G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” Proc. of International Conference on Learning Representations, 2018.
S. H. Lee, I. H. Suh, S. Calinon, and R. Johansson, “Learning basis skills by autonomous segmentation of humanoid motion trajectories,” Conference on Humanoid Robots, pp. 112–119, 2012.
L. Pais, K. Umezawa, Y. Nakamura, and A. Billard, “Learning robot skills through motion segmentation and constraints extraction,” Proc. of International Conference on Human-Robot Interaction, Workshop on Collaborative Manipulation, 2013.
S. Cho and S. Jo, “Incremental motion learning through kinesthetic teachings and new motion production from learned motions by a humanoid robot,” International Journal of Control, Automation and Systems, vol. 10, no. 1, pp. 126–135, 2012.
Article Google Scholar
Y. Zhao, A. Al-Yacoub, Y. M. Goh, L. Justham, N. Lohse, and M. R. Jackson, “Human skill capture: A hidden markov model of force and torque data in peg-in-a-hole assembly process,” Proc. of International Conference on Systems, Man, and Cybernetics, pp. 000655–000660, 2016.
R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multilevel discovery of deep options,” Computing Research Repository (CoRR), vol. abs/1703.08294, 2017.
K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” Conference on Computer Vision and Pattern Recognition, pp. 1250–1257, 2012.
S. Pillai, M. R. Walter, and S. J. Teller, “Learning articulated motions from visual demonstration,” Robotics: Science and Systems X, 2014.
E. Ugur, Y. Nagai, E. Sahin, and E. Oztop, “Staged development of robot skills: Behavior formation, affordance learning and imitation with motionese,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 2, pp. 119–139, 2015.
Article Google Scholar
Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,” Advances in Neural Information Processing Systems, pp. 1087–1098, 2017.
M. Wächter and T. Asfour, “Hierarchical segmentation of manipulation actions based on object relations and motion characteristics,” Proc.of International Conference on Advanced Robotics, pp. 549–556, 2015.
C. Wu, J. Zhang, S. Savarese, and A. Saxena, “Watch-n-patch: Unsupervised understanding of actions and relations,” Conference on Computer Vision and Pattern Recognition, pp. 4362–4370, 2015.
A. Mohseni-Kabir, C. Rich, S. Chernova, C. L. Sidner, and D. Miller, “Interactive hierarchical task learning from a single demonstration,” International Conference on Human-Robot Interaction, pp. 205–212, 2015.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, pp. 91–99, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
R. Gao, D. Jayaraman, and K. Grauman, “Object-centric representation learning from unlabeled videos,” Proc. of Asian Conference on Computer Vision, pp. 248–263, 2016.
C. Devin, P. Abbeel, T. Darrell, and S. Levine, “Deep object-centric representations for generalizable robot learning,” International Conference on Robotics and Automation, pp. 7111–7118, 2018.
D. Rawlinson, A. Ahmed, and G. Kowadlo, “Sparse unsupervised capsules generalize better,” Computing Research Repository (CoRR), vol. abs/1804.06094, 2018.
Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas, “Playing hard exploration games by watching youtube,” Advances in Neural Information Processing Systems, pp. 2935–2945, 2018.
P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine, “Time-contrastive networks: Self-supervised learning from video,” Proc. of International Conference on Robotics and Automation, pp. 1134–1141, 2018.
M. K. Patrick, A. F. Adekoya, A. A. Mighty, and B. Y. Edward, “Capsule networks-a survey,” Journal of King Saud University-Computer and Information Sciences, 2019.
M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang, “Investigating capsule networks with dynamic routing for text classification,” Conference on Empirical Methods in Natural Language Processing, pp. 3110–3119, 2018.

Download references

Author information

Authors and Affiliations

Automation and Systems Research Institute, Seoul National University, 1 Gwanak-ro, Gwanakgu, Seoul, 08826, Korea
J. Hyeon Park & Jigang Kim
Department of Aerospace Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Korea
H. Jin Kim

Authors

J. Hyeon Park
View author publications
You can also search for this author in PubMed Google Scholar
Jigang Kim
View author publications
You can also search for this author in PubMed Google Scholar
H. Jin Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to H. Jin Kim.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This journal was supported by the Agency for Defense Development under Contract UD 190026RD.

J.hyeon Park received his B.S. degree in mechanical and aerospace engineering from Seoul National University in 2015. He is currently pursuing an integrated M.S./Ph.D. degree in the Department of Mechanical and Aerospace Engineering at Seoul National University. His research interests include deep learning in robotics for perception and action.

Jigang Kim received his B.S. degree in mechanical and aerospace engineering from Seoul National University in 2018. He is currently pursuing an integrated M.S./Ph.D. degree in the Department of Mechanical and Aerospace Engineering at Seoul National University. His research interests include robot learning, machine learning, reinforcement learning.

H. Jin Kim received her B.S. degree from Korea Advanced Institute of Technology (KAIST) in 1995, and her M.S. and Ph.D. degrees in Mechanical Engineering from University of California, Berkeley (UC Berkeley), in 1999 and 2001, respectively. In September 2004 she joined the Department of Mechanical and Aerospace Engineering at Seoul National University, as an Assistant Professor where she is currently a Professor. Her research interests include autonomous robotics and robot vision.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, J.H., Kim, J. & Kim, H.J. Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation. Int. J. Control Autom. Syst. 19, 3409–3418 (2021). https://doi.org/10.1007/s12555-020-0155-9

Download citation

Received: 26 July 2020
Revised: 15 January 2021
Accepted: 22 February 2021
Published: 02 September 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s12555-020-0155-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation

Abstract

Access this article

Similar content being viewed by others

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

Learning Spatio-temporal Characteristics of Human Motions Through Observation

Self-supervised Sparse to Dense Motion Segmentation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatio-semantic Task Recognition: Unsupervised Learning of Task-discriminative Features for Segmentation and Imitation

Abstract

Access this article

Similar content being viewed by others

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

Learning Spatio-temporal Characteristics of Human Motions Through Observation

Self-supervised Sparse to Dense Motion Segmentation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation