当前位置: X-MOL 学术arXiv.cs.MA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents
arXiv - CS - Multiagent Systems Pub Date : 2021-02-16 , DOI: arxiv-2102.08159
Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana Obraztsova, Zinovi Rabinovich

Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method significantly outperforms state-of-the-art methods on challenging StarCraft II tasks, demonstrating enhanced coordination and improved sample efficiency.

中文翻译:

RMIX:合作强化学习代理的学习风险敏感策略

基于当前价值的多主体强化学习方法可优化个人Q值,从而通过具有分散执行力(CTDE)的集中训练来指导个人行为。然而,由于奖励的随机性和环境的不确定性,即使使用CTDE,这种预期的即风险中性的Q值也不足够,这导致这些方法无法在复杂的环境中训练协调剂。为了解决这些问题,我们提出了RMIX,这是一种新颖的合作式MARL方法,具有对个人Q值的学习分布进行条件风险值(CVaR)度量的功能。具体来说,我们首先学习个人的收益分布,以分析计算CVaR以进行分散执行。然后,为了处理执行过程中随机结果的时间性质,我们提出了用于调整风险水平的动态风险水平预测器。最后,我们使用CVaR值优化CVaR策略,该CVaR值用于估计集中训练期间TD错误中的目标,并且CVaR值用作辅助局部奖励,以通过分位数回归损失更新局部分布。从经验上看,我们证明了我们的方法在挑战性的StarCraft II任务上明显优于最新方法,证明了增强的协调性和更高的样品效率。
更新日期:2021-02-17
down
wechat
bug