当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents
arXiv - CS - Multiagent Systems Pub Date : 2021-02-16 , DOI: arxiv-2102.08159 Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana Obraztsova, Zinovi Rabinovich
arXiv - CS - Multiagent Systems Pub Date : 2021-02-16 , DOI: arxiv-2102.08159 Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana Obraztsova, Zinovi Rabinovich
Current value-based multi-agent reinforcement learning methods optimize
individual Q values to guide individuals' behaviours via centralized training
with decentralized execution (CTDE). However, such expected, i.e.,
risk-neutral, Q value is not sufficient even with CTDE due to the randomness of
rewards and the uncertainty in environments, which causes the failure of these
methods to train coordinating agents in complex environments. To address these
issues, we propose RMIX, a novel cooperative MARL method with the Conditional
Value at Risk (CVaR) measure over the learned distributions of individuals' Q
values. Specifically, we first learn the return distributions of individuals to
analytically calculate CVaR for decentralized execution. Then, to handle the
temporal nature of the stochastic outcomes during executions, we propose a
dynamic risk level predictor for risk level tuning. Finally, we optimize the
CVaR policies with CVaR values used to estimate the target in TD error during
centralized training and the CVaR values are used as auxiliary local rewards to
update the local distribution via Quantile Regression loss. Empirically, we
show that our method significantly outperforms state-of-the-art methods on
challenging StarCraft II tasks, demonstrating enhanced coordination and
improved sample efficiency.
中文翻译:
RMIX:合作强化学习代理的学习风险敏感策略
基于当前价值的多主体强化学习方法可优化个人Q值,从而通过具有分散执行力(CTDE)的集中训练来指导个人行为。然而,由于奖励的随机性和环境的不确定性,即使使用CTDE,这种预期的即风险中性的Q值也不足够,这导致这些方法无法在复杂的环境中训练协调剂。为了解决这些问题,我们提出了RMIX,这是一种新颖的合作式MARL方法,具有对个人Q值的学习分布进行条件风险值(CVaR)度量的功能。具体来说,我们首先学习个人的收益分布,以分析计算CVaR以进行分散执行。然后,为了处理执行过程中随机结果的时间性质,我们提出了用于调整风险水平的动态风险水平预测器。最后,我们使用CVaR值优化CVaR策略,该CVaR值用于估计集中训练期间TD错误中的目标,并且CVaR值用作辅助局部奖励,以通过分位数回归损失更新局部分布。从经验上看,我们证明了我们的方法在挑战性的StarCraft II任务上明显优于最新方法,证明了增强的协调性和更高的样品效率。
更新日期:2021-02-17
中文翻译:
RMIX:合作强化学习代理的学习风险敏感策略
基于当前价值的多主体强化学习方法可优化个人Q值,从而通过具有分散执行力(CTDE)的集中训练来指导个人行为。然而,由于奖励的随机性和环境的不确定性,即使使用CTDE,这种预期的即风险中性的Q值也不足够,这导致这些方法无法在复杂的环境中训练协调剂。为了解决这些问题,我们提出了RMIX,这是一种新颖的合作式MARL方法,具有对个人Q值的学习分布进行条件风险值(CVaR)度量的功能。具体来说,我们首先学习个人的收益分布,以分析计算CVaR以进行分散执行。然后,为了处理执行过程中随机结果的时间性质,我们提出了用于调整风险水平的动态风险水平预测器。最后,我们使用CVaR值优化CVaR策略,该CVaR值用于估计集中训练期间TD错误中的目标,并且CVaR值用作辅助局部奖励,以通过分位数回归损失更新局部分布。从经验上看,我们证明了我们的方法在挑战性的StarCraft II任务上明显优于最新方法,证明了增强的协调性和更高的样品效率。