Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient,IEEE Transactions on Neural Networks and Learning Systems

当前位置： X-MOL 学术 › IEEE Trans. Neural Netw. Learn. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.2 ) Pub Date : 2020-01-14 , DOI: 10.1109/tnnls.2019.2959129
Dongming Wu , Xingping Dong , Jianbing Shen , Steven C. H. Hoi

The overestimation caused by function approximation is a well-known property in Q-learning algorithms, especially in single-critic models, which leads to poor performance in practical tasks. However, the opposite property, underestimation, which often occurs in Q-learning methods with double critics, has been largely left untouched. In this article, we investigate the underestimation phenomenon in the recent twin delay deep deterministic actor-critic algorithm and theoretically demonstrate its existence. We also observe that this underestimation bias does indeed hurt performance in various experiments. Considering the opposite properties of single-critic and double-critic methods, we propose a novel triplet-average deep deterministic policy gradient algorithm that takes the weighted action value of three target critics to reduce the estimation bias. Given the connection between estimation bias and approximation error, we suggest averaging previous target values to reduce per-update error and further improve performance. Extensive empirical results over various continuous control tasks in OpenAI gym show that our approach outperforms the state-of-the-art methods.

中文翻译：

通过三重平均深度确定性策略梯度减少估计偏差

函数逼近引起的高估是 Q 学习算法中众所周知的特性，尤其是在单批评模型中，这会导致实际任务中表现不佳。然而，在具有双重批评的 Q 学习方法中经常出现的相反性质，即低估，在很大程度上没有受到影响。在本文中，我们研究了最近的双延迟深度确定性演员批评算法中的低估现象，并从理论上证明了其存在。我们还观察到，这种低估偏差确实会损害各种实验中的性能。考虑到单批评家和双批评家方法的相反特性，我们提出了一种新颖的三重平均深度确定性策略梯度算法，该算法采用三个目标批评家的加权行动值来减少估计偏差。考虑到估计偏差和近似误差之间的联系，我们建议对先前的目标值进行平均，以减少每次更新误差并进一步提高性能。 OpenAI 健身房中各种连续控制任务的广泛实证结果表明，我们的方法优于最先进的方法。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11