Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control,arXiv - CS - Systems and Control

当前位置： X-MOL 学术 › arXiv.cs.SY › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control
arXiv - CS - Systems and Control Pub Date : 2021-07-22 , DOI: arxiv-2107.10406
Dimitri Bertsekas

We introduce a contractive abstract dynamic programming framework and related policy iteration algorithms, specifically designed for sequential zero-sum games and minimax problems with a general structure. Aside from greater generality, the advantage of our algorithms over alternatives is that they resolve some long-standing convergence difficulties of the ``natural" policy iteration algorithm, which have been known since the Pollatschek and Avi-Itzhak method [PoA69] for finite-state Markov games. Mathematically, this ``natural" algorithm is a form of Newton's method for solving Bellman's equation, but Newton's method, contrary to the case of single-player DP problems, is not globally convergent in the case of a minimax problem, because the Bellman operator may have components that are neither convex nor concave. Our algorithms address this difficulty by introducing alternating player choices, and by using a policy-dependent mapping with a uniform sup-norm contraction property, similar to earlier works by Bertsekas and Yu [BeY10], [BeY12], [YuB13]. Moreover, our algorithms allow a convergent and highly parallelizable implementation, which is based on state space partitioning, and distributed asynchronous policy evaluation and policy improvement operations within each set of the partition. Our framework is also suitable for the use of reinforcement learning methods based on aggregation, which may be useful for large-scale problem instances.

中文翻译：

顺序零和博弈和极小极大控制的分布式异步策略迭代

我们引入了一个收缩抽象动态规划框架和相关的策略迭代算法，专门为具有通用结构的顺序零和游戏和极大极小问题而设计。除了更广泛的通用性之外，我们的算法相对于替代方案的优势在于它们解决了“自然”策略迭代算法的一些长期存在的收敛困难，这些困难自 Pollatschek 和 Avi-Itzhak 方法 [PoA69] 以来就为有限-状态马尔可夫游戏。在数学上，这种“自然”算法是求解贝尔曼方程的牛顿方法的一种形式，但牛顿方法与单人 DP 问题的情况相反，在极小极大问题的情况下不是全局收敛的，因为 Bellman 算子可能有既非凸也非凹的分量。我们的算法通过引入交替的玩家选择来解决这个困难，并通过使用具有统一 sup-norm 收缩属性的策略相关映射，类似于 Bertsekas 和 Yu [BeY10]、[BeY12]、[YuB13] 的早期作品。此外，我们的算法允许收敛和高度并行化的实现，它基于状态空间分区，以及每组分区内的分布式异步策略评估和策略改进操作。我们的框架也适用于使用基于聚合的强化学习方法，这可能对大规模问题实例有用。[YuB13]。此外，我们的算法允许收敛和高度并行化的实现，它基于状态空间分区，以及每组分区内的分布式异步策略评估和策略改进操作。我们的框架也适用于使用基于聚合的强化学习方法，这可能对大规模问题实例有用。[YuB13]。此外，我们的算法允许收敛和高度并行化的实现，它基于状态空间分区，以及每组分区内的分布式异步策略评估和策略改进操作。我们的框架也适用于使用基于聚合的强化学习方法，这可能对大规模问题实例有用。

更新日期：2021-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>