当前位置: X-MOL 学术Neural Comput. & Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Value targets in off-policy AlphaZero: a new greedy backup
Neural Computing and Applications ( IF 6 ) Pub Date : 2021-06-16 , DOI: 10.1007/s00521-021-05928-5
Daniel Willemsen , Hendrik Baier , Michael Kaisers

This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.



中文翻译:

off-policy AlphaZero 中的价值目标:新的贪婪备份

本文介绍并评估了一系列 AlphaZero 价值目标,包括以前的变体并引入了具有贪婪备份 (A0GB) 的 AlphaZero. 当前最先进的棋盘游戏算法使用基于样本的规划,例如蒙特卡洛树搜索 (MCTS),结合深度神经网络 (NN) 来逼近价值函数。这些算法,其中 AlphaZero 是一个突出的例子,由于它们依赖于许多神经网络评估,因此在计算上非常昂贵。这限制了它们的实际性能。我们通过为神经网络使用更有效的训练目标来改进 AlphaZero 的训练过程。我们引入了一个三维空间来描述一系列训练目标,包括原始的 AlphaZero 训练目标以及 soft- Z和来自文献的 A0C 变体。我们证明了 A0GB 使用来自该系列的特定新值目标,能够在一个小的表格域中找到最佳策略,而原始的 AlphaZero 目标则无法做到这一点。此外,我们表明soft- Z、A0C 和 A0GB 在两个基准棋盘游戏(Connect-Four 和 Breakthrough)上比原始 AlphaZero 目标实现了更好的性能和更快的训练。最后,我们将表格学习与 Tic-Tac-Toe 中基于神经网络的值函数逼近并置,并比较其中学习目标的效果。

更新日期:2021-06-16
down
wechat
bug