Model primitives for hierarchical lifelong reinforcement learning

Wu, Bohan; Gupta, Jayesh K.; Kochenderfer, Mykel

doi:10.1007/s10458-020-09451-0

Model primitives for hierarchical lifelong reinforcement learning

Published: 25 February 2020

Volume 34, article number 28, (2020)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

1024 Accesses
16 Citations
3 Altmetric
Explore all metrics

Abstract

Learning interpretable and transferable subpolicies and performing task decomposition from a single, complex task is difficult. Such decomposition can lead to immense sample efficiency gains in lifelong learning. Some traditional hierarchical reinforcement learning techniques enforce this decomposition in a top-down manner, while meta-learning techniques require a task distribution at hand to learn such decompositions. This article presents a framework for using diverse suboptimal world models to decompose complex task solutions into simpler modular subpolicies. Given these world models, this framework performs decomposition of a single source task in a bottom up manner, concurrently learning the required modular subpolicies as well as a controller to coordinate them. We perform a series of experiments on high dimensional continuous action control tasks to demonstrate the effectiveness of this approach at both complex single-task learning and lifelong learning. Finally, we perform ablation studies to understand the importance and robustness of different elements in the framework and limitations to this approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

HI-VAL: Iterative Learning of Hierarchical Value Functions for Policy Generation

Addressing Task Prioritization in Model-based Reinforcement Learning

References

Abel, D., Hershkowitz, D. E., & Littman, M. L. (2016). Near optimal behavior via approximate state abstraction. In International conference on machine learning (ICML) (pp. 2915–2923).
Abel, D., Arumugam, D., Lehnert, L., & Littman, M. L. (2017). Toward good abstractions for lifelong learning. In Proceedings of the NIPS workshop on hierarchical reinforcement learning.
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel. P, (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International conference on learning representations (ICLR).
Anand, A., Grover, A., Singla, P., et al. (2015). ASAP-UCT: Abstraction of state-action pairs in UCT. In International joint conference on artificial intelligence (IJCAI).
Andre, D., & Russell, S. J. (2002). State abstraction for programmable reinforcement learning agents. In AAAI conference on artificial intelligence (AAAI) (pp. 119–125).
Bacon, P., Harb, J., & Precup, D. (2017). The option-critic architecture. In AAAI conference on artificial intelligence (AAAI) (pp. 1726–1734).
Baird, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. IEEE International Conference on Neural Networks (ICNN), 4, 2448–2453.
Google Scholar
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. In Advances in neural information processing systems (NeurIPS) (pp. 4055–4065).
Bertsekas, D. P., & Castanon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.
Article MathSciNet MATH Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. CoRR. arXiv:1606.01540.
Brunskill, E., & Li, L. (2014). PAC-inspired option discovery in lifelong reinforcement learning. In International conference on machine learning (ICML) (pp. 316–324).
Cobo, L. C., Isbell Jr, C. L., & Thomaz, A. L. (2012). Automatic task decomposition and state abstraction from demonstration. In International conference on autonomous agents and multiagent systems (AAMAS) (pp. 483–490). International Foundation for Autonomous Agents and Multiagent Systems.
Daniel, C., Neumann, G., Kroemer, O., & Peters, J. (2016). Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(1), 3190–3239.
MathSciNet MATH Google Scholar
Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE computer society conference on computer vision and pattern recognition (CVPR) (pp. 248–255).
Denis, N., & Fraser, M. (2019). Options in multi-task reinforcement learning: Transfer via reflection. In Canadian conference on artificial intelligence (pp. 225–237). Springer.
Eysenbach, B., Gupta, A., Ibarz, J., & Levine, S. (2018). Diversity is all you need: Learning skills without a reward function. In International conference on learning representations (ICLR).
Finn, C., Abbeel, P., & Levine, S. (2017a). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (ICML) (pp. 1126–1135).
Finn, C., Yu, T., Zhang, T., Abbeel, P., & Levine, S. (2017b). One-shot visual imitation learning via meta-learning. In Conference on robot learning (pp. 357–368).
Florensa, C., Duan, Y., & Abbeel, P. (2016). Stochastic neural networks for hierarchical reinforcement learning. In International conference on learning representations (ICLR).
Frans, K., Ho, J., Chen, X., Abbeel, P., & Schulman, J. (2018). Meta learning shared hierarchies. In International conference on learning representations (ICLR).
Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Ali Eslami, S. M., & Teh, Y. W. (2018). Neural processes. CoRR. arXiv:1807.01622.
Ge, L., Gao, J., Ngo, H., Li, K., & Zhang, A. (2014). On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(4), 254–271.
Article MathSciNet Google Scholar
Gershman, S. J. (2018). The successor representation: Its computational logic and neural substrates. Journal of Neuroscience, 38(33), 7193–7200.
Article Google Scholar
Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle, H., Botvinick, M., Levine, S., & Bengio, Y. (2019a). Transfer and exploration via the information bottleneck. In International conference on learning representations (ICLR).
Goyal, A., Sodhani, S., Binas, J., Peng, X.B., Levine, S., & Bengio, Y. (2019b). Reinforcement learning with competitive ensembles of information-constrained primitives. CoRR. arXiv:1906.10667.
Grant, E., Finn, C., Levine, S., Darrell, T., & Griffiths, T. (2018). Recasting gradient-based meta-learning as hierarchical Bayes. CoRR. arXiv:1801.08930.
Guestrin, C., Koller, D., Gearhart, C., & Kanodia, N. (2003). Generalizing plans to new environments in relational MDPs. In International joint conference on artificial intelligence (IJCAI) (pp. 1003–1010). Morgan Kaufmann Publishers Inc.
Ha, D., & Schmidhuber, J. (2018). World models. CoRR. arXiv:1803.10122.
Hafez-Kolahi, H., & Kasaei, S. (2019). Information bottleneck and its applications in deep learning. Information Systems and Telecommunication, 3(4), 119.
Google Scholar
Harb, J., Bacon, P. L., Klissarov, M., & Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In AAAI conference on artificial intelligence (AAAI).
Holland, G. Z., Talvitie, E., & Bowling, M. (2018). The effect of planning shape on dyna-style planning in high-dimensional state spaces. CoRR. arXiv:1806.01825.
Isele, D., & Cosgun, A. (2018). Selective experience replay for lifelong learning. In AAAI conference on artificial intelligence (AAAI).
Isele, D., Rostami, M., & Eaton, E. (2016). Using task features for zero-shot knowledge transfer in lifelong learning. In International joint conference on artificial intelligence (IJCAI) (pp. 1620–1626).
Jain, A., Khetarpal, K., & Precup, D. (2018). Safe option-critic: Learning safety in the option-critic architecture. CoRR. arXiv:1807.08060.
Jong, N. K., & Stone, P. (2005). State abstraction discovery from irrelevant state variables. International Joint Conference on Artificial Intelligence (IJCAI), 8, 752–757.
Google Scholar
Keller, G. B., Bonhoeffer, T., & Hübener, M. (2012). Sensorimotor mismatch signals in primary visual cortex of the behaving mouse. Neuron, 74(5), 809–815. https://doi.org/10.1016/j.neuron.2012.03.040.
Article Google Scholar
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.
Article MathSciNet MATH Google Scholar
Kulkarni, T. D., Saeedi, A., Gautam, S., & Gershman, S. J. (2016). Deep successor reinforcement learning. CoRR. arXiv:1606.02396.
Leinweber, M., Ward, D. R., Sobczak, J. M., Attinger, A., & Keller, G. B. (2017). A sensorimotor circuit in mouse cortex for visual flow predictions. Neuron, 95(6), 1420–1432. https://doi.org/10.1016/j.neuron.2017.08.036.
Article Google Scholar
Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In ISAIM.
Liu, M., Machado, M. C., Tesauro, G., & Campbell, M. (2017). The eigenoption-critic framework. CoRR. arXiv:1712.04065.
Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., & Campbell, M. (2017). Eigenoption discovery through the deep successor representation. In International conference on learning representations (ICLR).
Machado, M. C., Bellemare, M. G., & Bowling, M. (2018). Count-based exploration with the successor representation. CoRR. arXiv:1807.11622.
Masoudnia, S., & Ebrahimpour, R. (2014). Mixture of experts: A literature survey. Artificial Intelligence Review, 42(2), 275–293.
Article Google Scholar
McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In G. H. Bower (Ed.), Psychology of learning and motivation, Vol 24, pp 109–165. Cambridge: Academic Press.
Google Scholar
Mendelssohn, R. (1982). An iterative aggregation procedure for Markov decision processes. Operations Research, 30(1), 62–73.
Article MathSciNet MATH Google Scholar
Neumann, G., Daniel, C., Paraschos, A., Kupcsik, A., & Peters, J. (2014). Learning modular policies for robotics. Frontiers of Computational Neuroscience, 8(62), 1–32.
Google Scholar
Nguyen-Tuong, D., & Peters, J. (2011). Model learning for robot control: A survey. Cognitive Processing, 12(4), 319–340.
Article Google Scholar
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54–71. https://doi.org/10.1016/j.neunet.2019.01.012.
Article Google Scholar
Reyman, G., & van der Wal, J. (1988). Aggregation–disaggregation algorithms for discrete stochastic systems. In DGOR/NSOR (pp. 515–522). Springer, Berlin.
Rosenbaum, D., & Weiss, Y. (2015). The return of the gating network: Combining generative models and discriminative training in natural image priors. In Advances in neural information processing systems (NeurIPS) (pp. 2683–2691).
Rosenstein, M. T., Marx, Z., Kaelbling, L. P., & Dietterich, T. G. (2005). To transfer or not to transfer. In NIPS 2005 workshop on transfer learning (Vol. 898, p. 3).
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. CoRR. arXiv:1606.04671.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. CoRR. arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR. arXiv:1707.06347.
Shamwell, E., Nothwang, W., & Perlis, D. (2018). An embodied multi-sensor fusion approach to visual motion estimation using unsupervised deep networks. Sensors, 18(5), 1427.
Article Google Scholar
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In IEEE international conference on computer vision (ICCV) (pp. 843–852).
Sung, F., Zhang, L., Xiang, T., Hospedales, T., & Yang, Y. (2017). Learning to learn: Meta-critic networks for sample efficient learning. CoRR. arXiv:1706.09529.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.
MATH Google Scholar
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.
Article MathSciNet MATH Google Scholar
Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In AAAI conference on artificial intelligence (AAAI).
Tanaka, F., & Yamamura, M. (2003). Multitask reinforcement learning on the distribution of MDPs. IEEE International Symposium on Computational Intelligence in Robotics and Automation, 3, 1108–1113.
Google Scholar
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., & Riedmiller, M. A. (2018). Deepmind control suite. CoRR. arXiv:1801.00690.
Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., & Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. In Advances in neural information processing systems (NeurIPS) (pp. 4496–4506).
Tessler, C., Givony, S., Zahavy, T., Mankowitz, DJ., & Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In AAAI conference on artificial intelligence (AAAI).
Thrun, S. (1995). A lifelong learning perspective for mobile robot control. In Intelligent robots and systems (pp. 201–214). Elsevier.
Thrun, S., & Pratt, L. (1998). Learning to learn: Introduction and overview. In S. Thrun & L. Pratt (Eds.), Learning to learn (pp. 3–17). Boston: Springer.
Chapter MATH Google Scholar
Tiwari, S., & Thomas, P. S. (2018). Natural option critic. CoRR. arXiv:1812.01488.
Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 5026–5033). https://doi.org/10.1109/IROS.2012.6386109.
Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal networks for hierarchical reinforcement learning. In International conference on machine learning (ICML) (pp. 3540–3549).
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007). Multi-task reinforcement learning: A hierarchical Bayesian approach. In International conference on machine learning (ICML) (pp. 1015–1022).
Yang, Y., Caluwaerts, K., Iscen, A., Tan, J., & Finn, C. (2019). NoRML: No-reward meta learning. In International conference on autonomous agents and multiagent systems (AAMAS) (pp. 323–331). International Foundation for Autonomous Agents and Multiagent Systems.
Zhang, J., Springenberg, J. T., Boedecker, J., & Burgard, W. (2017). Deep reinforcement learning with successor features for navigation across similar environments. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 2371–2378).
Zhang, S., & Whiteson, S. (2019). DAC: The double actor-critic architecture for learning options. CoRR. arXiv:1904.12691.
Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R., & Farhadi, A. (2017). Visual semantic planning using deep successor representations. In Proceedings of the IEEE international conference on computer vision (pp. 483–492).

Download references

Acknowledgements

We are thankful to the anonymous reviewers and everyone at the Stanford Intelligent Systems Laboratory for useful comments and suggestions. This work is supported in part by DARPA under Agreement Number D17AP00032. The content is solely the responsibility of the authors and does not necessarily represent the official views of DARPA. We are also grateful for the support from Google Cloud in scaling our experiments.

Author information

Authors and Affiliations

Columbia University, New York, USA
Bohan Wu
Stanford University, Stanford, USA
Jayesh K. Gupta & Mykel Kochenderfer

Authors

Bohan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jayesh K. Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Mykel Kochenderfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayesh K. Gupta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, B., Gupta, J.K. & Kochenderfer, M. Model primitives for hierarchical lifelong reinforcement learning. Auton Agent Multi-Agent Syst 34, 28 (2020). https://doi.org/10.1007/s10458-020-09451-0

Download citation

Published: 25 February 2020
DOI: https://doi.org/10.1007/s10458-020-09451-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model primitives for hierarchical lifelong reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

HI-VAL: Iterative Learning of Hierarchical Value Functions for Policy Generation

Addressing Task Prioritization in Model-based Reinforcement Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model primitives for hierarchical lifelong reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls

HI-VAL: Iterative Learning of Hierarchical Value Functions for Policy Generation

Addressing Task Prioritization in Model-based Reinforcement Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation