当前位置: X-MOL 学术Knowl. Eng. Rev. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Human–agent transfer from observations
The Knowledge Engineering Review ( IF 2.1 ) Pub Date : 2020-11-27 , DOI: 10.1017/s0269888920000387
Bikramjit Banerjee , Sneha Racharla

Learning from human demonstration (LfD), among many speedup techniques for reinforcement learning (RL), has seen many successful applications. We consider one LfD technique called human–agent transfer (HAT), where a model of the human demonstrator’s decision function is induced via supervised learning and used as an initial bias for RL. Some recent work in LfD has investigated learning from observations only, that is, when only the demonstrator’s states (and not its actions) are available to the learner. Since the demonstrator’s actions are treated as labels for HAT, supervised learning becomes untenable in their absence. We adapt the idea of learning an inverse dynamics model from the data acquired by the learner’s interactions with the environment and deploy it to fill in the missing actions of the demonstrator. The resulting version of HAT—called state-only HAT (SoHAT)—is experimentally shown to preserve some advantages of HAT in benchmark domains with both discrete and continuous actions. This paper also establishes principled modifications of an existing baseline algorithm—called A3C—to create its HAT and SoHAT variants that are used in our experiments.

中文翻译:

观察中的人类代理转移

在强化学习 (RL) 的许多加速技术中,从人类演示 (LfD) 中学习已经看到了许多成功的应用。我们考虑一种 LfD 技术,称为人工-代理转移(HAT),其中人类演示者的决策函数模型是通过监督学习诱导的,并用作 RL 的初始偏差。LfD 最近的一些工作只研究了从观察中学习,也就是说,当学习者只能使用演示者的状态(而不是其动作)时。由于演示者的行为被视为 HAT 的标签,因此监督学习在没有它们的情况下变得站不住脚。我们采用从学习者与环境交互获得的数据中学习逆动力学模型的想法,并将其部署以填补演示者缺失的动作。生成的 HAT 版本——称为仅限州的 HAT (SoHAT)- 实验表明,在具有离散和连续动作的基准域中保留了 HAT 的一些优势。本文还建立了对现有基线算法(称为 A3C)的原则性修改,以创建在我们的实验中使用的 HAT 和 SoHAT 变体。
更新日期:2020-11-27
down
wechat
bug