Optimistic Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds,Operations Research

当前位置： X-MOL 学术 › Operations Research › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimistic Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds
Operations Research ( IF 2.2 ) Pub Date : 2020-09-11 , DOI: 10.1287/opre.2019.1939
Daniel R. Jiang ₁ , Lina Al-Kanj ₂ , Warren B. Powell ₂

Affiliation

Monte Carlo Tree Search (MCTS), most famously used in game-play artificial intelligence (e.g., the game of Go), is a well-known strategy for constructing approximate solutions to sequential decision problems. Its primary innovation is the use of a heuristic, known as a default policy, to obtain Monte Carlo estimates of downstream values for states in a decision tree. This information is used to iteratively expand the tree towards regions of states and actions that an optimal policy might visit. However, to guarantee convergence to the optimal action, MCTS requires the entire tree to be expanded asymptotically. In this paper, we propose a new technique called Primal-Dual MCTS that utilizes sampled information relaxation upper bounds on potential actions, creating the possibility of "ignoring" parts of the tree that stem from highly suboptimal choices. This allows us to prove that despite converging to a partial decision tree in the limit, the recommended action from Primal-Dual MCTS is optimal. The new approach shows significant promise when used to optimize the behavior of a single driver navigating a graph while operating on a ride-sharing platform. Numerical experiments on a real dataset of 7,000 trips in New Jersey suggest that Primal-Dual MCTS improves upon standard MCTS by producing deeper decision trees and exhibits a reduced sensitivity to the size of the action space.

中文翻译：

采样信息松弛双界的乐观蒙特卡洛树搜索

蒙特卡洛树搜索（MCTS），最常用于游戏性人工智能（例如，围棋游戏）中，是一种用于为顺序决策问题构造近似解决方案的著名策略。它的主要创新是使用启发式方法（称为默认策略）来获取决策树中状态的下游值的蒙特卡洛估计。此信息用于将树迭代地扩展到最佳策略可能访问的状态和操作区域。但是，为了确保收敛到最佳动作，MCTS要求整个树渐近扩展。在本文中，我们提出了一种称为Primal-Dual MCTS的新技术，该技术利用了潜在行为上的采样信息松弛上限，从而创造了“忽略”的可能性来自高度次优选择的树的一部分。这使我们能够证明，尽管收敛到极限中的部分决策树，但是Primal-Dual MCTS的推荐操作是最佳的。当用于优化单个驾驶员在乘车共享平台上操作时导航图表的行为时，新方法显示出巨大的希望。在新泽西州7,000个行程的真实数据集上进行的数值实验表明，Primal-Dual MCTS通过生成更深的决策树来改进标准MCTS，并且对动作空间的大小具有降低的敏感性。当用于优化单个驾驶员在乘车共享平台上操作时导航图表的行为时，新方法显示出巨大的希望。在新泽西州7,000个行程的真实数据集上进行的数值实验表明，Primal-Dual MCTS通过生成更深的决策树来改进标准MCTS，并且对动作空间的大小具有降低的敏感性。当用于优化单个驾驶员在乘车共享平台上操作时导航图表的行为时，新方法显示出巨大的希望。在新泽西州7,000个行程的真实数据集上进行的数值实验表明，Primal-Dual MCTS通过生成更深的决策树来改进标准MCTS，并且对动作空间的大小具有降低的敏感性。

更新日期：2020-09-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文