当前位置: X-MOL 学术IEEE Trans. Cogn. Dev. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Policy Sharing using Aggregation Trees for Q-learning in a Continuous State and Action Spaces
IEEE Transactions on Cognitive and Developmental Systems ( IF 5 ) Pub Date : 2020-09-01 , DOI: 10.1109/tcds.2019.2926477
Yu-Jen Chen , Wei-Cheng Jiang , Ming-Yi Ju , Kao-Shing Hwang

$Q$ -learning is a generic approach that uses a finite discrete state and an action domain to estimate action values using tabular or function approximation methods. An intelligent agent eventually learns policies from continuous sensory inputs and encodes these environmental inputs onto a discrete state space. The application of $Q$ -learning in a continuous state/action domain is the subject of many studies. This paper uses a tree structure to approximate a $Q$ -function using in a continuous state domain. The agent selects a discretized action with a maximum $Q$ -value and this discretized action is then extended to a continuous action using an action bias function. Reinforcement learning is difficult for a single agent when the state space is huge. This proposed architecture is also applied to a multiagent system, wherein an individual agent transfers its useful $Q$ -values to other agents to accelerate the learning process. Policy is shared between agents by grafting the branches of trees in which $Q$ -values are stored to other trees. The results for simulation show that the proposed architecture performs better than tabular $Q$ -learning and significantly accelerates the learning process because all agents use the sharing mechanisms to cooperate with each other.

中文翻译:

在连续状态和动作空间中使用聚合树进行 Q-learning 的策略共享

$Q$ -learning 是一种通用方法,它使用有限离散状态和动作域来使用表格或函数近似方法来估计动作值。智能代理最终从连续的感官输入中学习策略,并将这些环境输入编码到离散状态空间中。的应用 $Q$ -在连续状态/动作域中学习是许多研究的主题。本文使用树结构来近似一个 $Q$ - 在连续状态域中使用的函数。代理选择具有最大值的离散化动作 $Q$ -value,然后使用动作偏差函数将此离散化动作扩展为连续动作。当状态空间巨大时,单个智能体很难进行强化学习。这种提议的架构也适用于多代理系统,其中单个代理转移其有用的 $Q$ - 其他代理的价值以加速学习过程。通过嫁接树枝在代理之间共享策略 $Q$ -values 存储到其他树。仿真结果表明,所提出的架构比表格架构表现更好 $Q$ -learning 并显着加快学习过程,因为所有代理都使用共享机制相互合作。
更新日期:2020-09-01
down
wechat
bug