当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Inverse reinforcement learning in contextual MDPs
Machine Learning ( IF 4.3 ) Pub Date : 2021-05-12 , DOI: 10.1007/s10994-021-05984-x
Stav Belogolovsky , Philip Korsunsky , Shie Mannor , Chen Tessler , Tom Zahavy

We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.



中文翻译:

上下文MDP中的逆强化学习

我们考虑了上下文马尔可夫决策过程(MDP)中的反强化学习任务。在此设置中,从分发中采样了定义奖励和过渡内核的上下文。另外,尽管奖励是上下文的函数,但是没有提供给代理。相反,代理会根据最佳策略观察演示。目的是学习奖励映射,以便即使遇到以前看不见的上下文(也称为零快照传输),代理也将发挥最佳作用。我们将此问题公式化为非微分凸优化问题,并提出了一种计算其次梯度的新颖算法。基于此方案,我们在理论上分析了几种方法,在实证上比较了样本的复杂性和可扩展性。最重要的是,我们在理论上和经验上都证明了我们的算法执行零脉冲转移(归纳到新的和看不见的上下文中)。具体而言,我们介绍了动态治疗方案中的经验实验,其目的是学习奖励功能,该功能基于对已诊断为败血症的患者进行治疗的记录数据来解释专家医师的行为。

更新日期:2021-05-13
down
wechat
bug