当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Numerical Quadrature for Probabilistic Policy Search
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 11-6-2018 , DOI: 10.1109/tpami.2018.2879335
Julia Vinogradska , Bastian Bischoff , Jan Achterhold , Torsten Koller , Jan Peters

Learning control policies has become an appealing alternative to the derivation of control laws based on classic control theory. Model-based approaches have proven an outstanding data efficiency, especially when combined with probabilistic models to eliminate model bias. However, a major difficulty for these methods is that multi-step-ahead predictions typically become intractable for larger planning horizons and can only poorly be approximated. In this paper, we propose the use of numerical quadrature to overcome this drawback and provide significantly more accurate multi-step-ahead predictions. As a result, our approach increases data efficiency and enhances the quality of learned policies. Furthermore, policy learning is not restricted to optimizing locally around one trajectory, as numerical quadrature provides a principled approach to extend optimization to all trajectories starting in a specified starting state region. Thus, manual effort, such as choosing informative starting points for simultaneous policy optimization, is significantly decreased. Furthermore, learning is highly robust to the choice of initial policy and, thus, interaction time with the system is minimized. Empirical evaluations on simulated benchmark problems show the efficiency of the proposed approach and support our theoretical results.

中文翻译:


概率策略搜索的数值求积



学习控制策略已成为基于经典控制理论推导控制律的一种有吸引力的替代方案。事实证明,基于模型的方法具有出色的数据效率,特别是与概率模型结合使用以消除模型偏差时。然而,这些方法的一个主要困难是,对于较大的规划范围,多步提前预测通常变得棘手,并且只能很差地近似。在本文中,我们建议使用数值求积来克服这个缺点,并提供更准确的多步提前预测。因此,我们的方法提高了数据效率并提高了学习策略的质量。此外,策略学习不限于围绕一个轨迹进行局部优化,因为数值求积提供了一种原则性方法,可以将优化扩展到从指定起始状态区域开始的所有轨迹。因此,手动工作(例如为同时策略优化选择信息丰富的起点)显着减少。此外,学习对于初始策略的选择非常稳健,因此与系统的交互时间被最小化。对模拟基准问题的实证评估显示了所提出方法的效率并支持了我们的理论结果。
更新日期:2024-08-22
down
wechat
bug