当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning
arXiv - CS - Machine Learning Pub Date : 2020-09-29 , DOI: arxiv-2009.13891
Haotian Fu, Hongyao Tang, Jianye Hao, Chen Chen, Xidong Feng, Dong Li, Wulong Liu

Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta-RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparse-reward environments. The experimental results show that CCM outperforms state-of-the-art algorithms by addressing previously mentioned problems respectively.

中文翻译:

Towards Effective Context for Meta-Reinforcement Learning:一种基于对比学习的方法

上下文,即先前收集到的轨迹的嵌入,是元强化学习 (Meta-RL) 算法的强大构造。通过以有效上下文为条件,Meta-RL 策略可以在几个适应步骤内轻松推广到新任务。我们认为,提高上下文质量涉及回答两个问题:1. 如何训练一个紧凑且足够的编码器,该编码器可以嵌入包含在先前轨迹中的特定于任务的信息?2. 如何收集对应上下文反映任务规范的信息轨迹?为此,我们提出了一种新的 Meta-RL 框架,称为 CCM(对比学习增强基于上下文的 Meta-RL)。我们首先关注不同任务背后的对比性质,并利用它来训练一个紧凑且足够的上下文编码器。更多,我们训练了一个单独的探索策略,并从理论上推导出一个新的基于信息增益的目标,旨在通过几个步骤收集信息轨迹。根据经验,我们在通用基准以及几个复杂的稀疏奖励环境中评估我们的方法。实验结果表明,通过分别解决前面提到的问题,CCM 优于最先进的算法。
更新日期:2020-10-08
down
wechat
bug