Probabilistic Policy Reuse for Safe Reinforcement Learning,ACM Transactions on Autonomous and Adaptive Systems

当前位置： X-MOL 学术 › ACM Trans. Auton. Adapt. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Probabilistic Policy Reuse for Safe Reinforcement Learning
ACM Transactions on Autonomous and Adaptive Systems ( IF 2.2 ) Pub Date : 2019-03-15 , DOI: 10.1145/3310090
Javier García ₁ , Fernando Fernández ₁

Affiliation

This work introduces Policy Reuse for Safe Reinforcement Learning , an algorithm that combines Probabilistic Policy Reuse and teacher advice for safe exploration in dangerous and continuous state and action reinforcement learning problems in which the dynamic behavior is reasonably smooth and the space is Euclidean. The algorithm uses a continuously increasing monotonic risk function that allows for the identification of the probability to end up in failure from a given state. Such a risk function is defined in terms of how far such a state is from the state space known by the learning agent. Probabilistic Policy Reuse is used to safely balance the exploitation of actual learned knowledge, the exploration of new actions, and the request of teacher advice in parts of the state space considered dangerous. Specifically, the π-reuse exploration strategy is used. Using experiments in the helicopter hover task and a business management problem, we show that the π-reuse exploration strategy can be used to completely avoid the visit to undesirable situations while maintaining the performance (in terms of the classical long-term accumulated reward) of the final policy achieved.

中文翻译：

安全强化学习的概率策略重用

本作品介绍安全强化学习的策略重用，一种结合概率策略重用和教师建议的算法，用于在危险和连续状态和动作强化学习问题中进行安全探索，其中动态行为相当平滑且空间为欧几里得。该算法使用连续增加的单调风险函数，允许识别从给定状态最终失败的概率。这种风险函数是根据这种状态与学习代理已知的状态空间的距离来定义的。概率策略重用用于安全地平衡对实际学习知识的利用、对新行动的探索以及在被认为是危险的状态空间部分中对教师建议的请求。具体来说，使用了 π-reuse 探索策略。

更新日期：2019-03-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11