Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?,arXiv - CS - Systems and Control

当前位置： X-MOL 学术 › arXiv.cs.SY › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?
arXiv - CS - Systems and Control Pub Date : 2020-04-02 , DOI: arxiv-2004.00915
Sebastien Gros, Mario Zanon, Alberto Bemporad

For all its successes, Reinforcement Learning (RL) still struggles to deliver formal guarantees on the closed-loop behavior of the learned policy. Among other things, guaranteeing the safety of RL with respect to safety-critical systems is a very active research topic. Some recent contributions propose to rely on projections of the inputs delivered by the learned policy into a safe set, ensuring that the system safety is never jeopardized. Unfortunately, it is unclear whether this operation can be performed without disrupting the learning process. This paper addresses this issue. The problem is analysed in the context of $Q$-learning and policy gradient techniques. We show that the projection approach is generally disruptive in the context of $Q$-learning though a simple alternative solves the issue, while simple corrections can be used in the context of policy gradient methods in order to ensure that the policy gradients are unbiased. The proposed results extend to safe projections based on robust MPC techniques.

中文翻译：

通过安全集投影的安全强化学习：如何实现最优？

尽管取得了所有成功，强化学习 (RL) 仍然努力为学习策略的闭环行为提供正式保证。除其他外，在安全关键系统方面保证 RL 的安全性是一个非常活跃的研究课题。最近的一些贡献建议依赖于将学习策略提供的输入投射到一个安全集合中，以确保系统安全永远不会受到危害。不幸的是，目前尚不清楚是否可以在不中断学习过程的情况下执行此操作。本文针对这个问题。该问题在 $Q$-learning 和策略梯度技术的背景下进行分析。我们表明投影方法在 $Q$-learning 的上下文中通常是破坏性的，尽管一个简单的替代方法可以解决这个问题，而简单的修正可以在策略梯度方法的上下文中使用，以确保策略梯度是无偏的。建议的结果扩展到基于稳健 MPC 技术的安全预测。

更新日期：2020-04-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文