当前位置: X-MOL 学术Int. J. Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deterministic policy gradient algorithms for semi-Markov decision processes
International Journal of Intelligent Systems ( IF 5.0 ) Pub Date : 2021-10-13 , DOI: 10.1002/int.22709
Ashkan Haji Hosseinloo 1 , Munther A. Dahleh 1
Affiliation  

A large class of sequential decision-making problems under uncertainty, with broad applications from preventive maintenance to event-triggered control can be modeled in the framework of semi-Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well-known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average-reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on-policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.

中文翻译:

半马尔可夫决策过程的确定性策略梯度算法

可以在半马尔可夫决策过程(SMDP)的框架中对一大类不确定性下的顺序决策问题进行建模,这些问题具有从预防性维护到事件触发控制的广泛应用。与马尔可夫决策过程 (MDP) 不同,SMDP 在在线和强化学习 (RL) 设置中的探索不足。在本文中,我们将 MDP 中众所周知的确定性策略梯度 (DPG) 定理扩展到平均奖励标准下的 SMDP。现有的随机策略梯度方法通常不仅需要大量样本进行训练,而且在应用于确定性最优策略问题时,梯度估计的方差也很大。我们的 DPG 方法可以潜在地解决这些问题。在这种方法的基础上,根据评论家的选择,可以在 RL 设置中轻松开发不同的 actor-critic 算法。我们提出了两个示例actor-critic 算法。两种算法都使用了我们开发的策略梯度定理,但使用了两个不同的批评者;一个使用简单的 SARSA 更新,而另一个使用相同的策略更新,但具有兼容的函数逼近器。我们通过数学和模拟证明了我们方法的有效性。
更新日期:2021-10-13
down
wechat
bug