当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback
arXiv - CS - Data Structures and Algorithms Pub Date : 2021-09-15 , DOI: arxiv-2109.07054
Ishaan Shah, David Halpern, Kavosh Asadi, Michael L. Littman

Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER.

中文翻译:

具有在奖励、策略和优势反馈下的资格跟踪的人在环策略梯度算法的收敛

流畅的人机通信对于人在环强化学习的未来至关重要。即使在他们拥有丰富的合作经验之前,代理也必须对其人类培训师的反馈做出适当的回应。因此,学习代理对人类培训师可能提供的各种反馈方案做出良好反应是很重要的。这项工作在三种不同类型的反馈-策略反馈、奖励反馈和优势反馈下分析了人类的收敛演员-评论家(COACH)算法。对于这三种反馈类型,我们发现 COACH 可能表现不佳。我们提出了一种 COACH 的变体,即情景式 COACH (E-COACH),我们证明它对所有三种类型都收敛。我们将我们的 COACH 变体与其他两种强化学习算法进行了比较:Q-learning 和 TAMER。
更新日期:2021-09-16
down
wechat
bug