Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Kim, Dong-Ki; Omidshafiei, Shayegan; Pazis, Jason; How, Jonathan P.

doi:10.1007/s10458-019-09439-5

Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Published: 13 January 2020

Volume 34, article number 16, (2020)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Dong-Ki Kim ORCID: orcid.org/0000-0002-5342-1197¹,
Shayegan Omidshafiei¹,
Jason Pazis² &
…
Jonathan P. How¹

491 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at https://github.com/shayegano/Arcade-Learning-Environment), and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at https://github.com/shayegano/CASL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Notes

Note that HRL with a single option is equivalent to a normal RL with primitive actions. Therefore, a single option CASL corresponds to A3C [30] but with the crossmodal attention mechanism.

References

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel, P. (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations (ICLR).
Alvis, C. D., Ott, L., & Ramos, F. (2017). Online learning for scene segmentation with laser-constrained CRFs. International Conference on Robotics and Automation (ICRA), 4639–4643.
Andreas, J., Klein, D., & Levine, S. (2016). Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796.
Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a GPU. In International Conference on Learning Representations (ICLR).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Beal, M. J., Attias, H., & Jojic, N. (2002). Audio-video sensor fusion with probabilistic graphical models. European Conference on Computer Vision (ECCV), 736–750.
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
Article Google Scholar
Bengio, S. (2002). An asynchronous hidden markov model for audio-visual speech recognition. Advances in Neural Information Processing Systems (NIPS), 1237–1244.
Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. International Conference on Robotics and Automation (ICRA), 2639–2645.
Caglayan, O., Barrault, L., & Bougares, F. (2016). Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976.
Carrasco, M. (2011). Visual attention: The past 25 years. Vision research, 51(13), 1484–1525.
Article Google Scholar
Chambers, A. D., Scherer, S., Yoder, L., Jain, S., Nuske, S. T., & Singh, S. (2014). Robust multi-sensor fusion for micro aerial vehicle navigation in GPS-degraded/denied environments. In American Control Conference (ACC).
Da Silva, B., Konidaris, G., & Barto, A. (2012). Learning parameterized skills. arXiv preprint arXiv:1206.6398.
Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust RGB-D object recognition. In International Conference on Intelligent Robots and Systems (IROS).
Harb, J., Bacon, P.-L., Klissarov, M., & Precup, D. (2017). When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571.
Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, arXiv:1512.03385.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Google Scholar
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.
Article MathSciNet Google Scholar
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Konidaris, G., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in Neural Information Processing Systems (NIPS), 1015–1023.
Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in Neural Information Processing Systems (NIPS), 3675–3683.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 282–289.
Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451–463.
Article Google Scholar
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., & Siegwart, R. (2013). A robust and modular multi-sensor fusion approach applied to mav navigation. International Conference on Intelligent Robots and Systems (IROS), 3923–3929.
Machado, M. C., Bellemare, M. G., & Bowling, M. (2017). A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956.
Mackintosh, N. J. (1975). A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82(4), 276.
Article Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning (ICML), 1928–1937.
Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (NIPS), 2204–2212.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Nachum, O., Gu, S. S., Lee, H., & Levine, S. (2018). Data-efficient hierarchical reinforcement learning. Advances in Neural Information Processing Systems (NIPS), 3306–3317.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. International Conference on Machine Learning (ICML), 689–696.
Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. Journal of Neuroscience, 35(21), 8145–8157.
Article Google Scholar
Nobili, S., Camurri, M., Barasuol, V., Focchi, M., Caldwell, D. G., Semini, C., & Fallon, M. (2017). Heterogeneous sensor fusion for accurate state estimation of dynamic legged robots. In Robotics: Science and Systems (RSS).
Pearce, J. M., & Hall, G. (1980). A model for pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87(6), 532.
Article Google Scholar
Pearce, J. M., & Mackintosh, N. J. (2010). Two theories of attention: A review and a possible integration. Attention and Associative Learning: From Brain to Behaviour, pages 11–39.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Article Google Scholar
Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.
Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15, 2949–2980.
MathSciNet MATH Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.
Article MathSciNet Google Scholar
Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems (NIPS), 2773–2781.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical attention networks for document classification. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1480–1489.
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2015). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 1–15.

Download references

Author information

Authors and Affiliations

Massachusetts Institute of Technology, Cambridge, USA
Dong-Ki Kim, Shayegan Omidshafiei & Jonathan P. How
Amazon Alexa, Cambridge, USA
Jason Pazis

Authors

Dong-Ki Kim
View author publications
You can also search for this author in PubMed Google Scholar
Shayegan Omidshafiei
View author publications
You can also search for this author in PubMed Google Scholar
Jason Pazis
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan P. How
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong-Ki Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jason Pazis: Work done prior to Amazon involvement of Jason Pazis, and does not reflect views of the Amazon company.

The work was supported by Boeing Research and Technology, ONR MURI Grant N000141110688, BRC Grant N000141712072, IBM (as part of the MIT-IBM Watson AI Lab initiative), and AWS Machine Learning Research Awards program. We also thank the three anonymous reviewers for their helpful suggestions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, DK., Omidshafiei, S., Pazis, J. et al. Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs. Auton Agent Multi-Agent Syst 34, 16 (2020). https://doi.org/10.1007/s10458-019-09439-5

Download citation

Published: 13 January 2020
DOI: https://doi.org/10.1007/s10458-019-09439-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Multi-agent deep reinforcement learning: a survey

A practical guide to multi-objective reinforcement learning and planning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Multi-agent deep reinforcement learning: a survey

A practical guide to multi-objective reinforcement learning and planning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation