Abstract
Human–object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer ⟨human, action, object⟩ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration, to infer the action. We thus propose a multi-level pairwise feature network (PFNet) for detecting human–object interactions. The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.
Article PDF
Similar content being viewed by others
References
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.; Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6517–6525, 2017.
Borji, A.; Cheng, M. M.; Hou, Q. B.; Jiang, H. Z.; Li, J. Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117–150, 2019.
Xu, D. F.; Zhu, Y. K.; Choy, C. B.; Fei-Fei, L. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097–3106, 2017.
Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1981–1990, 2019.
Chao, Y. W.; Liu, Y. F.; Liu, X. Y.; Zeng, H. Y.; Deng, J. Learning to detect human–object interactions. arXiv preprint arXiv:1702.05448, 2017.
Gkioxari, G.; Girshick, R.; Dollar, P.; He, K. M. Detecting and recognizing human–object interactions. arXiv preprint arXiv:1704.07333, 2017.
Ma, C. Y.; Kadav, A.; Melvin, I.; Kira, Z.; AlRegib, G.; Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6790–6800, 2018.
Mallya, A.; Lazebnik, S. Learning models for actions and person–object interactions with transfer to question answering. In: Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 414–428, 2016.
Gao, C.; Zou, Y. L.; Huang, J. B. iCAN: Instance-centric attention network for human–object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
Li, Y. L.; Zhou, S. Y.; Huang, X. J.; Xu, L.; Ma, Z.; Fang, H. S.; Wang, Y. F.; Lu, C. W. Transferable interactiveness knowledge for human-object interaction detection. arXiv preprint arXiv:1881.08264, 2019.
Wang, T. C.; Anwer, R. M.; Khan, M. H.; Khan, F. S.; Pang, Y. W.; Shao, L. et al. Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5693–5701, 2019.
Gupta, T.; Schwing, A. G.; Hoiem, D. No-frills human–object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9676–9684, 2019.
Wan, B.; Zhou, D. S.; Liu, Y. F.; Li, R. J.; He, X. M. Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9468–9477, 2019.
Zhou, P.; Chi, M. Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 843–851, 2019.
Gupta, S.; Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
Zhao, Z. C.; Ma, H. M.; You, S. D. Single image action recognition using semantic body part actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3411–3419, 2017.
Luvizon, D. C.; Picard, D.; Tabia, H. 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5137–5146, 2018.
Abdulmunem, A.; Lai, Y. K.; Sun, X. F. Saliency guided local and global descriptors for effective action recognition. Computational Visual Media Vol. 2, No. 1, 97–106, 2016.
Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467, 2017.
Ulutan, O.; Iftekhar, A. S. M.; Manjunath, B. S. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13617–13626, 2020.
Qi, S. Y.; Wang, W. G.; Jia, B. X.; Shen, J. B.; Zhu, S. C. Learning human–object interactions by graph parsing neural networks. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 407–423, 2018.
Xu, B.; Wong, Y.; Li, J.; Zhao, Q.; Kankanhalli, M. S. Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019–2028, 2019.
Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247–264, 2018.
Bansal, A.; Rambhatla, S. S.; Shrivastava, A.; Chellappa, R. Detecting human–object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 10460–10469, 2020.
Wang, T. C.; Yang, T.; Danelljan, M.; Khan, F. S.; Zhang, X. Y.; Sun, J. Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4115–4124, 2020.
Liao, Y.; Liu, S.; Wang, F.; Chen, Y. J.; Qian, C.; Feng, J. S. PPDM: Parallel point detection and matching for real-time human-object interaction detection. arXiv preprint arXiv:1912.12898, 2020.
He, K. M.; Gkioxari, G.; Dollar, P.; Girshick, R. B. “Mask R-CNN”. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 2, 386–397, 2020.
Fang, H. S.; Xie, S. Q.; Tai, Y. W.; Lu, C. W. RMPE: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.
Fang, H. S.; Cao, J. K.; Tai, Y. W.; Lu, C. W. Pairwise body-part attention for recognizing human–object interactions. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 52–68, 2018.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 3111–3119, 2013.
Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollar, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999–3007, 2017.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollaar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollar, P.; He, K. M. Detectron. 2018. Available at https://github.com/facebookresearch/detectron.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Zhou, T. F.; Wang, W. G.; Qi, S. Y.; Ling, H. B.; Shen, J. B. Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4262–4271, 2020.
Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling human–object interaction recognition through zero-shot learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1568–1576, 2018.
Li, Y. L.; Liu, X. P.; Lu, H.; Wang, S. Y.; Liu, J. Q.; Li, J. F.; Lu, C. W. Detailed 2D-3D joint representation for human–object interaction. arXiv preprint arXiv:2004.08154, 2020.
Li, Y. L.; Xu, L.; Liu, X. P.; Huang, X. J.; Xu, Y.; Wang, S. Y.; Fang, H. S.; Ma, Z.; Chen, M. Y.; Lu, C. W. PaStaNet: Toward human activity knowledge engine. arXiv preprint arXiv:2004.00945, 2020.
Acknowledgements
We thank the reviewers for their constructive comments. This work was supported by the National Natural Science Foundation of China (Project No. 61902210), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
Author information
Authors and Affiliations
Corresponding author
Additional information
Hanchao Liu is a master student in the Department of Computer Science and Technology, Tsinghua University. His research interests include image and video processing, and computer vision.
Tai-Jiang Mu is an assistant researcher in the Department of Computer Science and Technology, Tsinghua University, where he received his B.S. and Ph.D. degrees in computer science and technology in 2011 and 2016, respectively. His research interests include visual media learning, SLAM, and human robot interaction.
Xiaolei Huang is an associate professor in the College of Information Sciences and Technology at Pennsylvania State University. Her research interests lie in the intersection of biomedical image analysis, machine learning, and computer vision. She has over 140 publications and holds 7 patents in these areas. She is an associate editor of the Computer Vision and Image Understanding journal. She received her bachelor degree in computer science from Tsinghua University, and her master and doctoral degrees in computer science from Rutgers University.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Liu, H., Mu, TJ. & Huang, X. Detecting human—object interaction with multi-level pairwise feature network. Comp. Visual Media 7, 229–239 (2021). https://doi.org/10.1007/s41095-020-0188-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-020-0188-2