Detecting human—object interaction with multi-level pairwise feature network

Liu, Hanchao; Mu, Tai-Jiang; Huang, Xiaolei

doi:10.1007/s41095-020-0188-2

Detecting human—object interaction with multi-level pairwise feature network

Research Article
Open access
Published: 19 October 2020

Volume 7, pages 229–239, (2021)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

Detecting human—object interaction with multi-level pairwise feature network

Download PDF

Hanchao Liu¹,
Tai-Jiang Mu¹ &
Xiaolei Huang²

1842 Accesses
15 Citations
9 Altmetric
Explore all metrics

Abstract

Human–object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer ⟨human, action, object⟩ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration, to infer the action. We thus propose a multi-level pairwise feature network (PFNet) for detecting human–object interactions. The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.

Article PDF

Human-object interaction detection based on cascade multi-scale transformer

Article 16 February 2024

Human–object interaction recognition based on interactivity detection and multi-feature fusion

Article 26 April 2023

Semantic Inference Network for Human-Object Interaction Detection

References

He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.; Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Article Google Scholar
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6517–6525, 2017.
Borji, A.; Cheng, M. M.; Hou, Q. B.; Jiang, H. Z.; Li, J. Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117–150, 2019.
Article Google Scholar
Xu, D. F.; Zhu, Y. K.; Choy, C. B.; Fei-Fei, L. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097–3106, 2017.
Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1981–1990, 2019.
Chao, Y. W.; Liu, Y. F.; Liu, X. Y.; Zeng, H. Y.; Deng, J. Learning to detect human–object interactions. arXiv preprint arXiv:1702.05448, 2017.
Gkioxari, G.; Girshick, R.; Dollar, P.; He, K. M. Detecting and recognizing human–object interactions. arXiv preprint arXiv:1704.07333, 2017.
Ma, C. Y.; Kadav, A.; Melvin, I.; Kira, Z.; AlRegib, G.; Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6790–6800, 2018.
Mallya, A.; Lazebnik, S. Learning models for actions and person–object interactions with transfer to question answering. In: Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 414–428, 2016.
Chapter Google Scholar
Gao, C.; Zou, Y. L.; Huang, J. B. iCAN: Instance-centric attention network for human–object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
Li, Y. L.; Zhou, S. Y.; Huang, X. J.; Xu, L.; Ma, Z.; Fang, H. S.; Wang, Y. F.; Lu, C. W. Transferable interactiveness knowledge for human-object interaction detection. arXiv preprint arXiv:1881.08264, 2019.
Wang, T. C.; Anwer, R. M.; Khan, M. H.; Khan, F. S.; Pang, Y. W.; Shao, L. et al. Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5693–5701, 2019.
Gupta, T.; Schwing, A. G.; Hoiem, D. No-frills human–object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9676–9684, 2019.
Wan, B.; Zhou, D. S.; Liu, Y. F.; Li, R. J.; He, X. M. Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9468–9477, 2019.
Zhou, P.; Chi, M. Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 843–851, 2019.
Gupta, S.; Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
Zhao, Z. C.; Ma, H. M.; You, S. D. Single image action recognition using semantic body part actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3411–3419, 2017.
Luvizon, D. C.; Picard, D.; Tabia, H. 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5137–5146, 2018.
Abdulmunem, A.; Lai, Y. K.; Sun, X. F. Saliency guided local and global descriptors for effective action recognition. Computational Visual Media Vol. 2, No. 1, 97–106, 2016.
Article Google Scholar
Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467, 2017.
Ulutan, O.; Iftekhar, A. S. M.; Manjunath, B. S. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13617–13626, 2020.
Qi, S. Y.; Wang, W. G.; Jia, B. X.; Shen, J. B.; Zhu, S. C. Learning human–object interactions by graph parsing neural networks. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 407–423, 2018.
Chapter Google Scholar
Xu, B.; Wong, Y.; Li, J.; Zhao, Q.; Kankanhalli, M. S. Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019–2028, 2019.
Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247–264, 2018.
Chapter Google Scholar
Bansal, A.; Rambhatla, S. S.; Shrivastava, A.; Chellappa, R. Detecting human–object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 10460–10469, 2020.
Article Google Scholar
Wang, T. C.; Yang, T.; Danelljan, M.; Khan, F. S.; Zhang, X. Y.; Sun, J. Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4115–4124, 2020.
Liao, Y.; Liu, S.; Wang, F.; Chen, Y. J.; Qian, C.; Feng, J. S. PPDM: Parallel point detection and matching for real-time human-object interaction detection. arXiv preprint arXiv:1912.12898, 2020.
He, K. M.; Gkioxari, G.; Dollar, P.; Girshick, R. B. “Mask R-CNN”. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 2, 386–397, 2020.
Article Google Scholar
Fang, H. S.; Xie, S. Q.; Tai, Y. W.; Lu, C. W. RMPE: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.
Fang, H. S.; Cao, J. K.; Tai, Y. W.; Lu, C. W. Pairwise body-part attention for recognizing human–object interactions. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 52–68, 2018.
Chapter Google Scholar
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 3111–3119, 2013.
Google Scholar
Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollar, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999–3007, 2017.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollaar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Chapter Google Scholar
Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollar, P.; He, K. M. Detectron. 2018. Available at https://github.com/facebookresearch/detectron.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Zhou, T. F.; Wang, W. G.; Qi, S. Y.; Ling, H. B.; Shen, J. B. Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4262–4271, 2020.
Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling human–object interaction recognition through zero-shot learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1568–1576, 2018.
Li, Y. L.; Liu, X. P.; Lu, H.; Wang, S. Y.; Liu, J. Q.; Li, J. F.; Lu, C. W. Detailed 2D-3D joint representation for human–object interaction. arXiv preprint arXiv:2004.08154, 2020.
Li, Y. L.; Xu, L.; Liu, X. P.; Huang, X. J.; Xu, Y.; Wang, S. Y.; Fang, H. S.; Ma, Z.; Chen, M. Y.; Lu, C. W. PaStaNet: Toward human activity knowledge engine. arXiv preprint arXiv:2004.00945, 2020.

Download references

Acknowledgements

We thank the reviewers for their constructive comments. This work was supported by the National Natural Science Foundation of China (Project No. 61902210), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

Author information

Authors and Affiliations

Key Laboratory of Pervasive Computing, Ministry of Education, BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Hanchao Liu & Tai-Jiang Mu
College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, 16802, USA
Xiaolei Huang

Authors

Hanchao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tai-Jiang Mu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolei Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tai-Jiang Mu.

Additional information

Hanchao Liu is a master student in the Department of Computer Science and Technology, Tsinghua University. His research interests include image and video processing, and computer vision.

Tai-Jiang Mu is an assistant researcher in the Department of Computer Science and Technology, Tsinghua University, where he received his B.S. and Ph.D. degrees in computer science and technology in 2011 and 2016, respectively. His research interests include visual media learning, SLAM, and human robot interaction.

Xiaolei Huang is an associate professor in the College of Information Sciences and Technology at Pennsylvania State University. Her research interests lie in the intersection of biomedical image analysis, machine learning, and computer vision. She has over 140 publications and holds 7 patents in these areas. She is an associate editor of the Computer Vision and Image Understanding journal. She received her bachelor degree in computer science from Tsinghua University, and her master and doctoral degrees in computer science from Rutgers University.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Liu, H., Mu, TJ. & Huang, X. Detecting human—object interaction with multi-level pairwise feature network. Comp. Visual Media 7, 229–239 (2021). https://doi.org/10.1007/s41095-020-0188-2

Download citation

Received: 26 June 2020
Accepted: 20 July 2020
Published: 19 October 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s41095-020-0188-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Detecting human—object interaction with multi-level pairwise feature network

Abstract

Article PDF

Similar content being viewed by others

Human-object interaction detection based on cascade multi-scale transformer

Human–object interaction recognition based on interactivity detection and multi-feature fusion

Semantic Inference Network for Human-Object Interaction Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting human—object interaction with multi-level pairwise feature network

Abstract

Article PDF

Similar content being viewed by others

Human-object interaction detection based on cascade multi-scale transformer

Human–object interaction recognition based on interactivity detection and multi-feature fusion

Semantic Inference Network for Human-Object Interaction Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation