Reinforcement Learning to Create Evaluation and Policy Functions using Minimax Tree Search in Hex,IEEE Transactions on Games

当前位置： X-MOL 学术 › IEEE Trans. Games › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reinforcement Learning to Create Evaluation and Policy Functions using Minimax Tree Search in Hex
IEEE Transactions on Games ( IF 2.3 ) Pub Date : 2020-03-01 , DOI: 10.1109/tg.2019.2893343
Kei Takada , Hiroyuki Iizuka , Masahito Yamamoto

Recently, the use of reinforcement-learning algorithms has been proposed to create value and policy functions, and their effectiveness has been demonstrated using Go, Chess, and Shogi. In previous studies, the policy function was trained to predict the search probabilities of each move output by Monte Carlo tree search; thus, a number of simulations were required to obtain the search probabilities. We propose a reinforcement-learning algorithm with game of self-play to create value and policy functions such that the policy function is trained directly from the game results without the search probabilities. In this study, we use Hex, a board game developed by Piet Hein, to evaluate the proposed method. We demonstrate the effectiveness of the proposed learning algorithm in terms of the policy function accuracy, and play a tournament with the proposed computer Hex algorithm DeepEZO and 2017 world-champion programs. The tournament results demonstrate that DeepEZO outperforms all programs. DeepEZO achieved a winning percentage of 79.3% against the world-champion program MoHex2.0 under the same search conditions on

$13 \times 13$

board. We also show that the highly accurate policy functions can be created by training the policy functions to increase the number of moves to be searched in the loser position.

中文翻译：

使用十六进制极小极大树搜索创建评估和策略函数的强化学习

最近，已经提出使用强化学习算法来创建价值和策略函数，并且它们的有效性已通过使用证明去, 棋，和将棋. 在之前的研究中，策略函数被训练为通过蒙特卡洛树搜索来预测每个移动输出的搜索概率；因此，需要进行多次模拟才能获得搜索概率。我们提出了一种具有自我博弈的强化学习算法，以创建价值和策略函数，从而直接从游戏结果中训练策略函数，而无需搜索概率。在本研究中，我们使用十六进制，由 Piet Hein 开发的棋盘游戏，用于评估所提出的方法。我们证明了所提出的学习算法在策略函数准确性方面的有效性，并与所提出的计算机进行了一场比赛十六进制算法 DeepEZO 和 2017 年世界冠军计划。比赛结果表明 DeepEZO 优于所有程序。在相同搜索条件下，DeepEZO 对世界冠军程序 MoHex2.0 的胜率达到 79.3%

$13 \times 13$

木板。我们还表明，可以通过训练策略函数来增加在失败者位置搜索的移动次数来创建高度准确的策略函数。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文