当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Numerical Quadrature for Probabilistic Policy Search.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2018-11-02 , DOI: 10.1109/tpami.2018.2879335
Julia Vinogradska , Bastian Bischoff , Jan Achterhold , Torsten Koller , Jan Peters

Learning control policies has become an appealing alternative to the derivation of control laws based on classic control theory. Model-based approaches have proven an outstanding data efficiency, especially when combined with probabilistic models to eliminate model bias. However, a major difficulty for these methods is that multi-step-ahead predictions typically become intractable for larger planning horizons and can only poorly be approximated. In this paper, we propose the use of numerical quadrature to overcome this drawback and provide significantly more accurate multi-step-ahead predictions. As a result, our approach increases data efficiency and enhances the quality of learned policies. Furthermore, policy learning is not restricted to optimizing locally around one trajectory, as numerical quadrature provides a principled approach to extend optimization to all trajectories starting in a specified starting state region. Thus, manual effort, such as choosing informative starting points for simultaneous policy optimization, is significantly decreased. Furthermore, learning is highly robust to the choice of initial policy and, thus, interaction time with the system is minimized. Empirical evaluations on simulated benchmark problems show the efficiency of the proposed approach and support our theoretical results.

中文翻译:

概率策略搜索的数字正交。

学习控制策略已成为基于经典控制理论推导控制律的一种吸引人的选择。基于模型的方法已证明具有出色的数据效率,尤其是与概率模型结合以消除模型偏差时。但是,这些方法的主要困难在于,对于较大的计划范围,多步预测通常变得难以处理,并且只能很难地近似。在本文中,我们建议使用数值正交来克服此缺点,并提供明显更准确的多步提前预测。结果,我们的方法提高了数据效率并提高了学习策略的质量。此外,政策学习不仅限于围绕一条轨迹进行局部优化,因为数值正交提供了一种原理方法,可将优化扩展到从指定的起始状态区域开始的所有轨迹。因此,大大减少了手动工作,例如选择用于同时进行策略优化的信息性起点。此外,学习对于初始策略的选择具有很高的鲁棒性,因此与系统的交互时间得以最小化。对模拟基准问题的实证评估表明了该方法的有效性,并支持了我们的理论结果。与系统的交互时间被最小化。对模拟基准问题的实证评估表明了该方法的有效性,并支持了我们的理论结果。与系统的交互时间被最小化。对模拟基准问题的实证评估表明了该方法的有效性,并支持了我们的理论结果。
更新日期:2019-12-06
down
wechat
bug