Skip to main content

Advertisement

Log in

Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Computer vision algorithms performance are near or superior to humans in the visual problems including object recognition (especially those of fine-grained categories), segmentation, and 3D object reconstruction from 2D views. Humans are, however, capable of higher-level image analyses. A clear example, involving theory of mind, is our ability to determine whether a perceived behavior or action was performed intentionally or not. In this paper, we derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics, using the knowledge of self-propelled motion, Newtonian motion and their relationship. We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm. To test the derived algorithm, we constructed three dedicated datasets from abstract geometric animation to realistic videos of agents performing intentional and non-intentional actions. Experiments on these datasets show that our algorithm can recognize whether an action is intentional or not, even without training data. The performance is comparable to various supervised baselines quantitatively, with sensible intentionality segmentation qualitatively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Standalone means this concept only focuses on the movement at a specific time point rather than the relationship between actions.

  2. Here we are using the Computational, Algorithmic, and Implementational level from David Marr Marr (1982). The implementational level is not discussed since our work does not contribute to that specific level.

  3. https://www.mixamo.com/.

  4. However, one should also notice that acting to be non-intentional does not mean the action and kinematics of the agent lacks the characteristic of the genuine non-intentional movement.

  5. This experiment was added during the revision phase of this paper.

References

  • Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015) Visual commonsense for scene understanding using perception, semantic parsing and reasoning. In 2015 AAAI spring symposium series.

  • Aristotle, F. (1926). The art of rhetoric (Vol. 2). Cambridge, MA: Harvard University Press.

    Google Scholar 

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., & Sheikh, Y. (2018). OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008.

  • Chambon, V., Domenech, P., Jacquet, P. O., Barbalat, G., Bouton, S., Pacherie, E., et al. (2017). Neural coding of prior expectations in hierarchical intention inference. Scientific Reports, 7(1), 1278.

    Article  Google Scholar 

  • Chambon, V., Domenech, P., Pacherie, E., Koechlin, E., Baraduc, P., & Farrer, C. (2011). What are they up to? the role of sensory evidence and prior knowledge in action understanding. PloS One, 6(2), e17133.

    Article  Google Scholar 

  • Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., & Ouyang, W., et al. (2019). Hybrid task cascade for instance segmentation. arXiv preprint arXiv:1901.07518.

  • Del Rincón, J. M., Santofimia, M. J., & Nebel, J. C. (2013). Common-sense reasoning for human action recognition. Pattern Recognition Letters, 34(15), 1849–1860.

    Article  Google Scholar 

  • Descartes, R., & Lafleur, L. J. (1960). Meditations on first philosophy. New York: Bobbs-Merrill.

    Google Scholar 

  • Epstein, D., Chen, B., & Vondrick, C. (2020). Oops! predicting unintentional action in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 919–929).

  • Fang, Z., & López, A. M. (2019). Intention recognition of pedestrians and cyclists by 2d pose estimation. IEEE Transactions on Intelligent Transportation Systems.

  • Hastie, T., Tibshirani, & R. Friedman, J., (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer-Verlag New York. pp 37–38

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Heider, F., & Simmel, M. (1944). An experimental study of apparent behavior. The American Journal of Psychology, 57, 243–259.

    Article  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Luo, Y., & Baillargeon, R. (2005). Can a self-propelled box have a goal? psychological reasoning in 5-month-old infants. Psychological Science, 16(8), 601–608.

    Article  Google Scholar 

  • Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. USA: Henry Holt and Co.Inc.

    Google Scholar 

  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2640–2649).

  • Miller, G. (1998). WordNet: An electronic lexical database. Cambridge: MIT press.

    MATH  Google Scholar 

  • Ravichandar, H. C., & Dani, A. P. (2017). Human intention inference using expectation-maximization algorithm with online model learning. IEEE Transactions on Automation Science and Engineering, 14(2), 855–868.

    Article  Google Scholar 

  • Rudenko, A., Palmieri, L., Herman, M., Kitani, K.M., Gavrila, D.M., & Arras, K.O. (2019). Human motion trajectory prediction: A survey. arXiv preprint arXiv:1905.06113.

  • Sartori, L., Becchio, C., & Castiello, U. (2011). Cues to intention: The role of movement information. Cognition, 119(2), 242–252.

    Article  Google Scholar 

  • Speer, R., Chin, J., & Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence.

  • Tozeren, A. (2000). Human body dynamics: Classical mechanics and human movement. New York: Springer Publishing.

    Google Scholar 

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6450–6459).

  • Ullman, T., Baker, C., Macindoe, O., Evans, O., Goodman, N., & Tenenbaum, J. B. (2009). Help or hinder: Bayesian models of social goal inference. In Advances in neural information processing systems (pp. 1874–1882).

  • Varytimidis, D., Alonso-Fernandez, F., Duran, B., & Englund, C. (2018). Action and intention recognition of pedestrians in urban traffic. In 2018 14th International conference on signal-image technology & internet-based systems (SITIS) (pp. 676–682). IEEE.

  • Vondrick, C., Oktay, D., Pirsiavash, H., & Torralba, A. (2016). Predicting motivations of actions by leveraging text. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2997–3005).

  • Wei, P., Liu, Y., Shu, T., Zheng, N., & Zhu, S.C. (2018). Where and why are they looking? jointly inferring human attention and intentions in complex tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6801–6809).

  • Wilson, G., & Shpall, S. (2016). Action. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy, winter (2016th ed.). Metaphysics Research Lab: Stanford University.

    Google Scholar 

  • Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2018). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126(2–4), 375–389.

    Article  MathSciNet  Google Scholar 

  • You, D., Hamsici, O. C., & Martinez, A. M. (2011). Kernel optimization in discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 631–638.

    Article  Google Scholar 

  • Zellers, R., Bisk, Y., Farhadi, A., & Choi, Y. (2018). From recognition to cognition: Visual commonsense reasoning. arXiv preprint arXiv:1811.10830.

Download references

Acknowledgements

This research was supported by the National Institutes of Health (NIH), Grants R01-DC-014498 and R01-EY-020834, the Human Frontier Science Program (HFSP), Grant RGP0036/2016, and a grant from Ohio State’s Center for Cognitive and Brain Sciences.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qianli Feng.

Additional information

Communicated by Deva Ramanan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Synakowski, S., Feng, Q. & Martinez, A. Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent. Int J Comput Vis 129, 942–959 (2021). https://doi.org/10.1007/s11263-020-01404-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01404-0

Keywords

Navigation