Skip to main content
Log in

Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at https://github.com/shayegano/Arcade-Learning-Environment), and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at https://github.com/shayegano/CASL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Note that HRL with a single option is equivalent to a normal RL with primitive actions. Therefore, a single option CASL corresponds to A3C [30] but with the crossmodal attention mechanism.

References

  1. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel, P. (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations (ICLR).

  2. Alvis, C. D., Ott, L., & Ramos, F. (2017). Online learning for scene segmentation with laser-constrained CRFs. International Conference on Robotics and Automation (ICRA), 4639–4643.

  3. Andreas, J., Klein, D., & Levine, S. (2016). Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796.

  4. Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

  5. Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a GPU. In International Conference on Learning Representations (ICLR).

  6. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  7. Beal, M. J., Attias, H., & Jojic, N. (2002). Audio-video sensor fusion with probabilistic graphical models. European Conference on Computer Vision (ECCV), 736–750.

  8. Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.

    Article  Google Scholar 

  9. Bengio, S. (2002). An asynchronous hidden markov model for audio-visual speech recognition. Advances in Neural Information Processing Systems (NIPS), 1237–1244.

  10. Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. International Conference on Robotics and Automation (ICRA), 2639–2645.

  11. Caglayan, O., Barrault, L., & Bougares, F. (2016). Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976.

  12. Carrasco, M. (2011). Visual attention: The past 25 years. Vision research, 51(13), 1484–1525.

    Article  Google Scholar 

  13. Chambers, A. D., Scherer, S., Yoder, L., Jain, S., Nuske, S. T., & Singh, S. (2014). Robust multi-sensor fusion for micro aerial vehicle navigation in GPS-degraded/denied environments. In American Control Conference (ACC).

  14. Da Silva, B., Konidaris, G., & Barto, A. (2012). Learning parameterized skills. arXiv preprint arXiv:1206.6398.

  15. Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust RGB-D object recognition. In International Conference on Intelligent Robots and Systems (IROS).

  16. Harb, J., Bacon, P.-L., Klissarov, M., & Precup, D. (2017). When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571.

  17. Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527.

  18. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, arXiv:1512.03385.

  19. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Google Scholar 

  20. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.

    Article  MathSciNet  Google Scholar 

  21. Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.

  22. Konidaris, G., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in Neural Information Processing Systems (NIPS), 1015–1023.

  23. Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in Neural Information Processing Systems (NIPS), 3675–3683.

  24. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 282–289.

  25. Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451–463.

    Article  Google Scholar 

  26. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

  27. Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., & Siegwart, R. (2013). A robust and modular multi-sensor fusion approach applied to mav navigation. International Conference on Intelligent Robots and Systems (IROS), 3923–3929.

  28. Machado, M. C., Bellemare, M. G., & Bowling, M. (2017). A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956.

  29. Mackintosh, N. J. (1975). A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82(4), 276.

    Article  Google Scholar 

  30. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning (ICML), 1928–1937.

  31. Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (NIPS), 2204–2212.

  32. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

    Article  Google Scholar 

  33. Nachum, O., Gu, S. S., Lee, H., & Levine, S. (2018). Data-efficient hierarchical reinforcement learning. Advances in Neural Information Processing Systems (NIPS), 3306–3317.

  34. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. International Conference on Machine Learning (ICML), 689–696.

  35. Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. Journal of Neuroscience, 35(21), 8145–8157.

    Article  Google Scholar 

  36. Nobili, S., Camurri, M., Barasuol, V., Focchi, M., Caldwell, D. G., Semini, C., & Fallon, M. (2017). Heterogeneous sensor fusion for accurate state estimation of dynamic legged robots. In Robotics: Science and Systems (RSS).

  37. Pearce, J. M., & Hall, G. (1980). A model for pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87(6), 532.

    Article  Google Scholar 

  38. Pearce, J. M., & Mackintosh, N. J. (2010). Two theories of attention: A review and a possible integration. Attention and Associative Learning: From Brain to Behaviour, pages 11–39.

  39. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.

    Article  Google Scholar 

  40. Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.

  41. Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15, 2949–2980.

    MathSciNet  MATH  Google Scholar 

  42. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.

  43. Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.

    Article  MathSciNet  Google Scholar 

  44. Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.

  45. Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems (NIPS), 2773–2781.

  46. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical attention networks for document classification. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1480–1489.

  47. Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2015). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 1–15.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong-Ki Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jason Pazis: Work done prior to Amazon involvement of Jason Pazis, and does not reflect views of the Amazon company.

The work was supported by Boeing Research and Technology, ONR MURI Grant N000141110688, BRC Grant N000141712072, IBM (as part of the MIT-IBM Watson AI Lab initiative), and AWS Machine Learning Research Awards program. We also thank the three anonymous reviewers for their helpful suggestions.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, DK., Omidshafiei, S., Pazis, J. et al. Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs. Auton Agent Multi-Agent Syst 34, 16 (2020). https://doi.org/10.1007/s10458-019-09439-5

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s10458-019-09439-5

Keywords

Navigation