当前位置: X-MOL 学术arXiv.cs.MA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Competitive Policy Optimization
arXiv - CS - Multiagent Systems Pub Date : 2020-06-18 , DOI: arxiv-2006.10611
Manish Prajapat, Kamyar Azizzadenesheli, Alexander Liniger, Yisong Yue, Anima Anandkumar

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

中文翻译:

竞争政策优化

竞争性马尔可夫决策过程中策略优化的核心挑战是设计具有理想收敛性和稳定性属性的有效优化方法。为了解决这个问题,我们提出了竞争策略优化 (CoPO),这是一种新颖的策略梯度方法,它利用竞争性游戏的博弈论性质来推导策略更新。受竞争梯度优化方法的启发,我们推导出游戏目标的双线性近似。相比之下,现成的策略梯度方法仅使用线性近似值,因此无法捕获参与者之间的交互。我们以两种方式实例化 CoPO:(i)竞争策略梯度,以及(ii)信任区域竞争策略优化。我们从理论上研究这些方法,并在一组全面的、但具有挑战性的竞争性游戏。我们观察到它们提供了稳定的优化、对复杂策略的收敛以及在与基线策略梯度方法进行比赛时的更高分数。
更新日期:2020-06-19
down
wechat
bug