The impact of environmental stochasticity on value-based multiobjective reinforcement learning,Neural Computing and Applications

当前位置： X-MOL 学术 › Neural Comput. & Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The impact of environmental stochasticity on value-based multiobjective reinforcement learning
Neural Computing and Applications ( IF 4.5 ) Pub Date : 2021-03-12 , DOI: 10.1007/s00521-021-05859-1
Peter Vamplew , Cameron Foale , Richard Dazeley

A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return and show that this may require different solutions than ESR or conventional SER. The analysis of the interaction between environmental stochasticity and multi-objective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions.

中文翻译：

环境随机性对基于价值的多目标强化学习的影响

使用增强学习方法来解决多目标问题的常用方法是延长无模型，基于值的算法，例如Q -学习的使用矢量Q组合-值与通常基于scalarisation适当的动作选择机构。这些方法的大多数现有经验评估都集中在确定性环境上。这项研究探讨了奖励和状态转换对随机性的影响，对多目标Q行为的影响-学习。它表明最佳解决方案的性质取决于这些环境特征，还取决于我们是否希望最大化预期标的回报（ESR）或标本的预期回报（SER）。我们还确定了一个新的目标，该目标可能会在满足回报率变化的约束的情况下最大化SER的某些应用中出现，并表明这可能需要与ESR或常规SER不同的解决方案。环境随机性与多目标Q之间相互作用的分析对一些具有不同特征的简单多目标马尔可夫决策过程的经验评估为学习提供了支持。这包括对具有随机奖励的环境中学习确定性SER最优策略的新颖方法的演示。此外，我们报告了在状态随机的环境中，采用无模型，基于价值的方法进行多目标强化学习的一个先前未发现的问题。在强调了基于值的无模型MORL方法的局限性之后，我们讨论了几种更适合于最大化具有随机过渡的MOMDP中的SER的替代方法。

更新日期：2021-03-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文