Variance reduced value iteration and faster algorithms for solving Markov decision processes,Naval Research Logistics

当前位置： X-MOL 学术 › Naval Research Logistics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Variance reduced value iteration and faster algorithms for solving Markov decision processes
Naval Research Logistics ( IF 2.3 ) Pub Date : 2021-04-22 , DOI: 10.1002/nav.21992
Aaron Sidford ₁ , Mengdi Wang _{2,

3} , Xian Wu ₁ , Yinyu Ye ₁

Affiliation

In this paper we provide faster algorithms for approximately solving discounted Markov decision processes in multiple parameter regimes. Given a discounted Markov decision process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M], we show how to compute an ϵ-optimal policy, with probability 1 − δ in time (Note: We use $urn:x-wiley:0894069X:media:nav21992:nav21992-math-0001$ to hide polylogarithmic factors in the input parameters, that is, $urn:x-wiley:0894069X:media:nav21992:nav21992-math-0002$ .) $urn:x-wiley:0894069X:media:nav21992:nav21992-math-0003$ This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1) time. Under this assumption we provide an algorithm which computes an ϵ-optimal policy for $urn:x-wiley:0894069X:media:nav21992:nav21992-math-0004$ with probability 1 − δ in time $urn:x-wiley:0894069X:media:nav21992:nav21992-math-0005$ Furthermore, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed-horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.

中文翻译：

用于求解马尔可夫决策过程的方差减少值迭代和更快的算法

在本文中，我们提供了更快的算法来近似求解多参数状态下的贴现马尔可夫决策过程。给定折扣马尔可夫决策过程（DMDP）：S | 州，| 一个| 动作、折扣因子γ ε (0, 1) 和 [− M , M ] 范围内的奖励，我们展示了如何计算 ϵ 最优策略，时间概率为 1 − δ $瓮:x-wiley:0894069X:媒体:nav21992:nav21992-math-0001$ （注意：我们使用隐藏多对数该贡献 $瓮:x-wiley:0894069X:媒体:nav21992:nav21992-math-0002$ 反映 $瓮:x-wiley:0894069X:媒体:nav21992:nav21992-math-0003$ 了第一个近线性时间、近线性收敛算法，用于求解γ中间值的 DMDP。我们还展示了如何获得改进的次线性时间算法，前提是我们可以在O (1) 时间内从转换函数中进行采样。在这个假设下，我们提供了一种算法，可以及时计算 ε 最优策略， $瓮:x-wiley:0894069X:媒体:nav21992:nav21992-math-0004$ 概率为 1 − δ $瓮:x-wiley:0894069X:媒体:nav21992:nav21992-math-0005$ 此外，我们扩展了这两种算法来求解有限视野 MDP。我们的算法改进了之前的最佳算法，用于近似计算多参数体系中固定范围 MDP 的最优策略。有趣的是，我们通过仔细修改近似值迭代来获得结果。我们展示了如何将经典的近似值迭代分析与方差减少的新技术相结合。我们最快的算法利用进一步的见解来确保我们的算法朝着最佳值单调前进。本文是使用采样获得线性收敛线性规划算法的少数实例之一，我们希望该分析可以更广泛地发挥作用。

更新日期：2021-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>