当前位置: X-MOL 学术Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Risk-averse policy optimization via risk-neutral policy optimization
Artificial Intelligence ( IF 5.1 ) Pub Date : 2022-07-14 , DOI: 10.1016/j.artint.2022.103765
Lorenzo Bisi , Davide Santambrogio , Federico Sandrelli , Andrea Tirinzoni , Brian D. Ziebart , Marcello Restelli

Keeping risk under control is a primary objective in many critical real-world domains, including finance and healthcare. The literature on risk-averse reinforcement learning (RL) has mostly focused on designing ad-hoc algorithms for specific risk measures. As such, most of these algorithms do not easily generalize to measures other than the one they are designed for. Furthermore, it is often unclear whether state-of-the-art risk-neutral RL algorithms can be extended to reduce risk. In this paper, we take a step towards overcoming these limitations, proposing a single framework to optimize some of the most popular risk measures, including conditional value-at-risk, utility functions, and mean-variance. Leveraging recent theoretical results on state augmentation, we transform the decision-making process so that optimizing the chosen risk measure in the original environment is equivalent to optimizing the expected cost in the transformed one. We then present a simple risk-sensitive meta-algorithm that transforms the trajectories it collects from the environment and feeds these into any risk-neutral policy optimization method. Finally, we provide extensive experiments that show the benefits of our approach over existing ad-hoc methodologies in different domains, including the Mujoco robotic suite and a real-world trading dataset.



中文翻译:

通过风险中性策略优化的风险规避策略优化

控制风险是许多关键现实世界领域的主要目标,包括金融和医疗保健。有关风险规避强化学习 (RL) 的文献主要集中在为特定风险度量设计临时算法。因此,这些算法中的大多数都不容易推广到它们所设计的度量之外的度量。此外,通常不清楚是否可以扩展最先进的风险中性 RL 算法以降低风险。在本文中,我们朝着克服这些限制迈出了一步,提出了一个单一框架来优化一些最流行的风险度量,包括条件风险价值、效用函数和均值方差。利用最近关于状态增强的理论结果,我们改变了决策过程,使得在原始环境中优化所选择的风险度量等同于在转换后的环境中优化预期成本。然后,我们提出了一个简单的风险敏感元算法,它可以转换从环境中收集的轨迹,并将这些轨迹输入到任何风险中性的策略优化方法中。最后,我们提供了广泛的实验,展示了我们的方法相对于不同领域中现有的 ad-hoc 方法的好处,包括 Mujoco 机器人套件和真实世界的交易数据集。

更新日期:2022-07-14
down
wechat
bug