Multiagent Reinforcement Learning: Rollout and Policy Iteration,IEEE/CAA Journal of Automatica Sinica

当前位置： X-MOL 学术 › IEEE/CAA J. Automatica Sinica › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multiagent Reinforcement Learning: Rollout and Policy Iteration
IEEE/CAA Journal of Automatica Sinica ( IF 11.8 ) Pub Date : 2021-01-08 , DOI: 10.1109/jas.2021.1003814
Dimitri Bertsekas

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program. In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents. For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.

中文翻译：

多主体强化学习：推出和策略迭代

我们使用基于策略迭代（PI）思想的方法讨论复杂的多阶段决策问题的解决方案，即从某些基本策略开始并生成改进的策略。推出是这种类型中最简单的方法，其中仅生成一个改进的策略。我们可以将PI视为重复应用推广，其中每次迭代的推广策略都将用作下一次迭代的基本策略。与PI相比，推出具有鲁棒性：可以在线应用，并且适合于在线重新计划。此外，部署可以将PI制定的策略之一用作基本策略，从而改进该策略。这是成功实施AlphaZero国际象棋程序的基础。在本文中，我们集中于解决问题的展开和类似PI的方法，其中控件由多个组件组成，每个组件由一个单独的代理选择（概念上）。这是多代理问题的一类，其中代理具有共享的目标功能以及共享的完美状态信息。基于权衡控制空间复杂度与状态空间复杂度的问题，我们开发了一种方法，通过该方法，在每个阶段，代理顺序（一次）执行一个使用基本策略的本地部署算法，与其他代理商的一些协调信息。每个阶段所需的总计算量与代理数量成线性增长。相比之下，在标准部署算法中，总计算量随代理的数量呈指数增长。尽管所需计算量大大减少，但我们证明了我们的多主体部署算法具有标准部署的基本成本改进特性：相对于基本策略，它保证了改进的性能。我们还将讨论自主多代理推出方案，该方案允许代理通过使用预先计算的信令信息来自主决策，该信息足以维持成本改善的特性，而无需在代理之间进行任何在线控制选择。对于打折和其他无限期问题，我们还考虑了精确的和近似的PI算法，这些算法涉及一种新型的一次单代理策略改进操作。对于我们的一种PI算法，我们证明了收敛于逐个代理的最佳策略，从而与团队理论建立了联系。对于另一种在更复杂的状态空间上执行的PI算法，我们证明了收敛到最优策略的效果。根据策略和价值神经网络的使用，还给出了这些算法的近似形式。这些PI算法的精确形式和近似形式都是严格的离线方法，但是它们可用于提供在线多代理部署方案中使用的基本策略。

更新日期：2021-01-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>