当前位置:
X-MOL 学术
›
arXiv.cs.RO
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards
arXiv - CS - Robotics Pub Date : 2021-02-23 , DOI: arxiv-2102.11941 Miguel Calvo-Fullana, Santiago Paternain, Luiz F. O. Chamon, Alejandro Ribeiro
arXiv - CS - Robotics Pub Date : 2021-02-23 , DOI: arxiv-2102.11941 Miguel Calvo-Fullana, Santiago Paternain, Luiz F. O. Chamon, Alejandro Ribeiro
Constrained reinforcement learning involves multiple rewards that must
individually accumulate to given thresholds. In this class of problems, we show
a simple example in which the desired optimal policy cannot be induced by any
linear combination of rewards. Hence, there exist constrained reinforcement
learning problems for which neither regularized nor classical primal-dual
methods yield optimal policies. This work addresses this shortcoming by
augmenting the state with Lagrange multipliers and reinterpreting primal-dual
methods as the portion of the dynamics that drives the multipliers evolution.
This approach provides a systematic state augmentation procedure that is
guaranteed to solve reinforcement learning problems with constraints. Thus,
while primal-dual methods can fail at finding optimal policies, running the
dual dynamics while executing the augmented policy yields an algorithm that
provably samples actions from the optimal policy.
中文翻译:
国家增强约束强化学习:通过奖励克服学习的局限性
受约束的强化学习涉及多个奖励,这些奖励必须单独累积到给定的阈值。在此类问题中,我们展示了一个简单的示例,在该示例中,期望的最优策略无法通过任何线性的奖励组合来诱导。因此,存在受限的强化学习问题,对于这些强化学习问题,正规化方法或经典原始对偶方法都无法产生最优策略。这项工作通过使用拉格朗日乘数来扩充状态,并将原始对偶方法重新解释为驱动乘数演化的动力学部分,从而解决了这一缺陷。这种方法提供了系统的状态增强过程,可以保证解决约束条件下的增强学习问题。因此,虽然原始对偶方法可能无法找到最佳策略,
更新日期:2021-02-25
中文翻译:
国家增强约束强化学习:通过奖励克服学习的局限性
受约束的强化学习涉及多个奖励,这些奖励必须单独累积到给定的阈值。在此类问题中,我们展示了一个简单的示例,在该示例中,期望的最优策略无法通过任何线性的奖励组合来诱导。因此,存在受限的强化学习问题,对于这些强化学习问题,正规化方法或经典原始对偶方法都无法产生最优策略。这项工作通过使用拉格朗日乘数来扩充状态,并将原始对偶方法重新解释为驱动乘数演化的动力学部分,从而解决了这一缺陷。这种方法提供了系统的状态增强过程,可以保证解决约束条件下的增强学习问题。因此,虽然原始对偶方法可能无法找到最佳策略,