RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning,arXiv - CS - Computer Science and Game Theory

当前位置： X-MOL 学术 › arXiv.cs.GT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning
arXiv - CS - Computer Science and Game Theory Pub Date : 2020-09-10 , DOI: arxiv-2009.06373
Huale Li, Xuan Wang, Fengwei Jia, Yifan Li, Yulin Wu, Jiajia Zhang, Shuhan Qi

Counterfactual regret minimization (CFR) is a popular method to deal with decision-making problems of two-player zero-sum games with imperfect information. Unlike existing studies that mostly explore for solving larger scale problems or accelerating solution efficiency, we propose a framework, RLCFR, which aims at improving the generalization ability of the CFR method. In the RLCFR, the game strategy is solved by the CFR in a reinforcement learning framework. And the dynamic procedure of iterative interactive strategy updating is modeled as a Markov decision process (MDP). Our method, RLCFR, then learns a policy to select the appropriate way of regret updating in the process of iteration. In addition, a stepwise reward function is formulated to learn the action policy, which is proportional to how well the iteration strategy is at each step. Extensive experimental results on various games have shown that the generalization ability of our method is significantly improved compared with existing state-of-the-art methods.

中文翻译：

RLCFR：通过深度强化学习最小化反事实的遗憾

反事实后悔最小化 (CFR) 是一种流行的方法，用于处理具有不完全信息的两人零和博弈的决策问题。与现有研究主要探索解决更大规模问题或加速求解效率不同，我们提出了一个框架 RLCFR，旨在提高 CFR 方法的泛化能力。在 RLCFR 中，游戏策略由强化学习框架中的 CFR 解决。并将迭代交互策略更新的动态过程建模为马尔可夫决策过程（MDP）。我们的方法 RLCFR 然后学习了一个策略，以在迭代过程中选择合适的后悔更新方式。此外，制定了逐步奖励函数来学习行动策略，这与迭代策略在每一步的表现成正比。

更新日期：2020-09-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文