Systems & Control Letters ( IF 2.1 ) Pub Date : 2021-08-14 , DOI: 10.1016/j.sysconle.2021.105009 Vivek S. Borkar 1 , Siddharth Chandak 1
We consider a prospect theoretic version of the classical Q-learning algorithm for discounted reward Markov decision processes, wherein the controller perceives a distorted and noisy future reward, modeled by a nonlinearity that accentuates gains and under-represents losses relative to a reference point. We analyze the asymptotic behavior of the scheme by analyzing its limiting differential equation and using the theory of monotone dynamical systems to infer its asymptotic behavior. Specifically, we show convergence to equilibria, and establish some qualitative facts about the equilibria themselves.
中文翻译:
前景理论 Q 学习
我们考虑了用于折扣奖励马尔可夫决策过程的经典 Q 学习算法的前景理论版本,其中控制器感知扭曲和嘈杂的未来奖励,由非线性建模,该非线性增强了相对于参考点的收益并低估了损失。我们通过分析其极限微分方程并利用单调动力系统理论推断其渐近行为来分析该方案的渐近行为。具体来说,我们展示了对均衡的收敛,并建立了一些关于均衡本身的定性事实。