当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning
arXiv - CS - Machine Learning Pub Date : 2023-05-25 , DOI: arxiv-2305.15669
Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, Ya-Qin Zhang

Offline-to-online reinforcement learning (RL), by combining the benefits of offline pretraining and online finetuning, promises enhanced sample efficiency and policy performance. However, existing methods, effective as they are, suffer from suboptimal performance, limited adaptability, and unsatisfactory computational efficiency. We propose a novel framework, PROTO, which overcomes the aforementioned limitations by augmenting the standard RL objective with an iteratively evolving regularization term. Performing a trust-region-style update, PROTO yields stable initial finetuning and optimal final performance by gradually evolving the regularization term to relax the constraint strength. By adjusting only a few lines of code, PROTO can bridge any offline policy pretraining and standard off-policy RL finetuning to form a powerful offline-to-online RL pathway, birthing great adaptability to diverse methods. Simple yet elegant, PROTO imposes minimal additional computation and enables highly efficient online finetuning. Extensive experiments demonstrate that PROTO achieves superior performance over SOTA baselines, offering an adaptable and efficient offline-to-online RL framework.

中文翻译:

PROTO:迭代策略正则化离线到在线强化学习

离线到在线强化学习 (RL) 通过结合离线预训练和在线微调的优势,有望提高样本效率和策略性能。然而,现有方法虽然有效,但性能欠佳、适应性有限且计算效率不理想。我们提出了一种新颖的框架 PROTO,它通过使用迭代演化的正则化项增强标准 RL 目标来克服上述限制。执行信任区域式更新,PROTO 通过逐渐演变正则化项以放松约束强度来产生稳定的初始微调和最佳最终性能。通过仅调整几行代码,PROTO 可以桥接任何离线策略预训练和标准离线策略 RL 微调,形成强大的离线到在线 RL 路径,对多种方法具有很强的适应性。PROTO 简单而优雅,只需最少的额外计算,即可实现高效的在线微调。大量实验表明,PROTO 实现了优于 SOTA 基线的性能,提供了一个适应性强且高效的离线到在线 RL 框架。
更新日期:2023-05-26
down
wechat
bug