当前位置: X-MOL 学术Auton. Agent. Multi-Agent Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploration in policy optimization through multiple paths
Autonomous Agents and Multi-Agent Systems ( IF 1.9 ) Pub Date : 2021-06-26 , DOI: 10.1007/s10458-021-09518-6
Ling Pan , Qingpeng Cai , Longbo Huang

Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MP-PO), which does not incur high computation cost and ensures stability. MP-PO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MP-PO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.



中文翻译:

多路径策略优化探索

近年来,深度强化学习取得了巨大的进步。然而,一个具有挑战性的问题是代理可能会遇到低效探索,特别是对于 on-policy 方法。以前的探索方法要么依赖复杂的结构来估计状态的新颖性,要么产生导致不稳定的敏感超参数。我们提出了一种有效的探索方法,即多路径策略优化(MP-PO),它不会产生高计算成本并确保稳定性。MP-PO 维护着一种有效的机制,可以有效地利用大量不同的策略来实现更好的探索,尤其是在稀疏环境中。我们也给出了性能稳定的理论保证。我们基于两种广泛采用的策略方法构建我们的方案,Trust-Region 策略优化算法和近端策略优化算法。我们对几个 MuJoCo 任务及其稀疏变体进行了广泛的实验,以公平地评估所提出的方法。结果表明,MP-PO 在样品效率和最终性能方面明显优于最先进的探索方法。

更新日期:2021-06-28
down
wechat
bug