Simplifying Deep Reinforcement Learning via Self-Supervision,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Simplifying Deep Reinforcement Learning via Self-Supervision
arXiv - CS - Machine Learning Pub Date : 2021-06-10 , DOI: arxiv-2106.05526
Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, Xia Hu

Supervised regression to demonstrations has been demonstrated to be a stable way to train deep policy networks. We are motivated to study how we can take full advantage of supervised loss functions for stably training deep reinforcement learning agents. This is a challenging task because it is unclear how the training data could be collected to enable policy improvement. In this work, we propose Self-Supervised Reinforcement Learning (SSRL), a simple algorithm that optimizes policies with purely supervised losses. We demonstrate that, without policy gradient or value estimation, an iterative procedure of ``labeling" data and supervised regression is sufficient to drive stable policy improvement. By selecting and imitating trajectories with high episodic rewards, SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time, showing the potential of solving reinforcement learning with supervised learning techniques. The code is available at https://github.com/daochenzha/SSRL

中文翻译：

通过自我监督简化深度强化学习

监督回归演示已被证明是训练深度策略网络的稳定方法。我们有动力研究如何充分利用监督损失函数来稳定训练深度强化学习代理。这是一项具有挑战性的任务，因为尚不清楚如何收集训练数据以实现政策改进。在这项工作中，我们提出了自监督强化学习 (SSRL)，这是一种简单的算法，可以优化具有纯监督损失的策略。我们证明，在没有策略梯度或值估计的情况下，“标记”数据和监督回归的迭代过程足以推动稳定的策略改进。通过选择和模仿具有高情景奖励的轨迹，SSRL 与具有更稳定性能和更短运行时间的当代算法相比具有惊人的竞争力，显示了使用监督学习技术解决强化学习的潜力。代码可在 https://github.com/daochenzha/SSRL 获得

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文