当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Importance sampling in reinforcement learning with an estimated behavior policy
Machine Learning ( IF 4.3 ) Pub Date : 2021-05-07 , DOI: 10.1007/s10994-020-05938-9
Josiah P. Hanna , Scott Niekum , Peter Stone

In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.



中文翻译:

通过估计的行为策略在强化学习中进行重要性抽样

在强化学习中,重要性采样是一种广泛使用的方法,用于在一个策略的数据分布实际上是由另一个策略生成数据时评估期望的方法。重要性抽样需要计算目标策略和数据生成行为策略的行动概率之间的似然比。在本文中,我们研究了重要性抽样,其中在观察到的数据下,行为策略操作概率被其对这些概率的最大似然估计所代替。我们展示了这种通用技术可减少由于蒙特卡洛样式估算器中的采样误差引起的方差。我们介绍了两种新颖的估计器,它们使用此技术来估计RL文献中出现的期望值。我们发现这些通用估计量减少了蒙特卡洛采样方法的方差,从而导致更快地学习了策略梯度算法和更准确地进行了非策略性策略评估。我们还提供了理论分析,表明我们的新估计量是一致的,并且与Monte Carlo估计量相比,具有渐近降低的方差。

更新日期:2021-05-08
down
wechat
bug