当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Obtaining accurate estimated action values in categorical distributional reinforcement learning
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2020-01-18 , DOI: 10.1016/j.knosys.2020.105511
Yingnan Zhao , Peng Liu , Chenjia Bai , Wei Zhao , Xianglong Tang

Categorical Distributional Reinforcement Learning (CDRL) uses a categorical distribution with evenly spaced outcomes to model the entire distribution of returns and produces state-of-the-art empirical performance. However, using inappropriate bounds with CDRL may generate inaccurate estimated action values, which affect the policy update step and the final performance. In CDRL, the bounds of the distribution indicate the range of the action values that the agent can obtain in one task, without considering the policy’s performance and state–action pairs. The action values that the agent obtains are often far from the bounds, and this reduces the accuracy of the estimated action values. This paper describes a method of obtaining more accurate estimated action values for CDRL using adaptive bounds. This approach enables the bounds of the distribution to be adjusted automatically based on the policy and state–action pairs. To achieve this, we save the weights of the critic network over a fixed number of time steps, and then apply a bootstrapping method. In this way, we can obtain confidence intervals for the upper and lower bound, and then use the upper and lower bound of these intervals as the new bounds of the distribution. The new bounds are more appropriate for the agent and provide a more accurate estimated action value. To further correct the estimated action values, a distributional target policy is proposed as a smoothing method. Experiments show that our method outperforms many state-of-the-art methods on the OpenAI gym tasks.



中文翻译:

在分类分布强化学习中获得准确的估计动作值

分类分布强化学习(CDRL)使用具有均匀间隔结果的分类分布来建模收益的整个分布,并产生最新的经验绩效。但是,对CDRL使用不合适的边界可能会生成不正确的估计操作值,从而影响策略更新步骤和最终性能。在CDRL中,分布范围表示代理可以在一项任务中获得的操作值范围,而无需考虑策略的性能和状态-操作对。代理获得的动作值通常远离界限,这降低了估计的动作值的准确性。本文介绍了一种使用自适应范围为CDRL获取更准确的估计行动值的方法。这种方法可以根据策略和状态-动作对自动调整分布范围。为此,我们在固定的时间步长上节省了评论网络的权重,然后应用引导方法。这样,我们可以获取上下限的置信区间,然后将这些区间的上下限用作分布的新边界。新界限更适合代理,并提供更准确的估计动作值。为了进一步校正估计的动作值,提出了一种分配目标策略作为平滑方法。实验表明,在OpenAI体育馆任务中,我们的方法优于许多最新方法。

更新日期:2020-01-18
down
wechat
bug