Causal Policy Gradients,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Causal Policy Gradients
arXiv - CS - Multiagent Systems Pub Date : 2021-02-20 , DOI: arxiv-2102.10362
Thomas Spooner, Nelson Vadori, Sumitra Ganesh

Policy gradient methods can solve complex tasks but often fail when the dimensionality of the action-space or objective multiplicity grow very large. This occurs, in part, because the variance on score-based gradient estimators scales quadratically with the number of targets. In this paper, we propose a causal baseline which exploits independence structure encoded in a novel action-target influence network. Causal policy gradients (CPGs), which follow, provide a common framework for analysing key state-of-the-art algorithms, are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes. We provide an analysis of the proposed estimator and identify the conditions under which variance is guaranteed to improve. The algorithmic aspects of CPGs are also discussed, including optimal policy factorisations, their complexity, and the use of conditioning to efficiently scale to extremely large, concurrent tasks. The performance advantages for two variants of the algorithm are demonstrated on large-scale bandit and concurrent inventory management problems.

中文翻译：

因果政策梯度

策略梯度方法可以解决复杂的任务，但是当操作空间或目标多样性的维数很大时，通常会失败。之所以会出现这种情况，部分原因是基于得分的梯度估算器的方差与目标数量成二次比例关系。在本文中，我们提出了一个因果基线，该基线利用了在新型行动目标影响网络中编码的独立性结构。随后的因果策略梯度（CPG）提供了一个用于分析关键最新技术的通用框架，被证明可以推广传统的策略梯度，并提供了一种结合问题域生成过程的先验知识的原则方法。我们对提出的估算器进行了分析，并确定了保证方差得到改善的条件。还讨论了CPG的算法方面，包括最佳策略分解，它们的复杂性以及使用条件来有效地扩展到非常大的并发任务。在大规模盗匪和并发库存管理问题上证明了该算法的两种变体的性能优势。

更新日期：2021-02-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文